与 'urllib.request' 模块一起使用时，'html2text' 模块不起作用答案

【问题标题】：The 'html2text' module not working when using with 'urllib.request' module与 'urllib.request' 模块一起使用时，'html2text' 模块不起作用
【发布时间】：2020-10-05 11:16:10
【问题描述】：

我想获取网页的所有文本，因此我尝试将 html2text 模块与 urllib.request 模块一起使用--

import urllib.request 
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
u=request_url.read()
print(html2text.html2text(u))
print('Done')

但我收到以下错误--

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 947, in html2text
    return h.handle(html)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 142, in handle
    self.feed(data)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 138, in feed
    data = data.replace("</' + 'script>", "</ignore>")
TypeError: a bytes-like object is required, not 'str'

【问题讨论】：

标签： python python-3.x web-scraping urllib3

【解决方案1】：

正如错误所说，html2text 需要一个 bytes-like 对象，所以你应该这样做：

import urllib.request 
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
print(html2text.html2text(request_url))
print('Done')

但这不仅会抛出403，而且似乎html2text 与Python3 不兼容。例如，请参阅此question。

所以，我会建议一种不同的方法，例如：

import requests
from bs4 import BeautifulSoup


headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-GB,en;q=0.5",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:81.0) Gecko/20100101 Firefox/81.0",
}


req = requests.get('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi', headers).text
soup = BeautifulSoup(req, "html.parser").find("h1")
print(soup.getText(strip=True))

打印：Let's perform Google search with python

【讨论】：