【问题标题】:The 'html2text' module not working when using with 'urllib.request' module与 'urllib.request' 模块一起使用时,'html2text' 模块不起作用
【发布时间】:2020-10-05 11:16:10
【问题描述】:

我想获取网页的所有文本,因此我尝试将 html2text 模块与 urllib.request 模块一起使用--

import urllib.request 
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
u=request_url.read()
print(html2text.html2text(u))
print('Done')

但我收到以下错误--

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 947, in html2text
    return h.handle(html)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 142, in handle
    self.feed(data)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 138, in feed
    data = data.replace("</' + 'script>", "</ignore>")
TypeError: a bytes-like object is required, not 'str'

【问题讨论】:

    标签: python python-3.x web-scraping urllib3


    【解决方案1】:

    正如错误所说,html2text 需要一个 bytes-like 对象,所以你应该这样做:

    import urllib.request 
    import html2text
    request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
    print(html2text.html2text(request_url))
    print('Done')
    

    但这不仅会抛出403,而且似乎html2text 与Python3 不兼容。例如,请参阅此question

    所以,我会建议一种不同的方法,例如:

    import requests
    from bs4 import BeautifulSoup
    
    
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-GB,en;q=0.5",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:81.0) Gecko/20100101 Firefox/81.0",
    }
    
    
    req = requests.get('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi', headers).text
    soup = BeautifulSoup(req, "html.parser").find("h1")
    print(soup.getText(strip=True))
    

    打印:Let's perform Google search with python

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-03
      • 2019-01-28
      • 1970-01-01
      • 1970-01-01
      • 2020-07-03
      相关资源
      最近更新 更多