Python 请求未提取所有元素答案

【问题标题】：Python requests is not extracting all elementsPython 请求未提取所有元素
【发布时间】：2018-11-15 09:38:40
【问题描述】：

我正在尝试从以下页面提取 TR 数据： http://www.datasheetcatalog.com/catalog/p1342320.shtml

我正在使用请求和BeautifulSoup。但是，我没有得到所有行（第二个表中只有 12 行而不是 22 行）。有没有人对此有解释（前提是打印 response.content 时有这些行。）？

这是我正在使用的代码：

from bs4 import BeautifulSoup
import requests

session = requests.Session()

url = 'http://www.datasheetcatalog.com/catalog/p1342320.shtml'
response = session.get(url)

soup = BeautifulSoup(response.content,"lxml")

trs=  soup.findAll('table')[8].findAll('tr')
print (len(trs))

【问题讨论】：

我得到了22 作为print(len(tr2)) 的输出...你想要的输出是什么？
奇怪！ ...我得到 12 而不是 22
@Andersson 你用的是哪个python版本？
我使用的是 Python 3.6
是的，仍然收到22

标签： python-3.x beautifulsoup python-requests

【解决方案1】：

在对html页面进行详细检查后，我发现beautifulsoup在点击cmets()后停止了。所以解决办法是把解析器从“lxml”改成“html5lib”：

soup = BeautifulSoup(response.content,"html5lib")

【讨论】：

@ewwink 感谢您的贡献

【解决方案2】：

html 无效，这里破坏了BeautifulSoup 进行修复

....
html_doc = response.text.replace('<table <', '<')
html_doc = re.sub(r'<\!--\s+\d+\s+--\!>', '', html_doc)
html_doc = re.sub(r'</?font.*?>' ,'', html_doc)
soup = BeautifulSoup(html_doc, "html.parser")

trs=  soup.findAll('table')[8].findAll('tr')
print (len(trs))

注意：使用 lxml 返回 7 而不是 22

【讨论】：

只需将解析器从 lxml 更改为 html5lib 即可！谢谢