无法从python中的html页面中提取文本答案

【问题标题】：Unable extract text from html page in python无法从python中的html页面中提取文本
【发布时间】：2016-12-20 14:16:55
【问题描述】：

我对网络抓取非常陌生。我阅读了 BeautifulSoup 并尝试使用它。但我无法提取具有给定类名“company-desc-and-sort-container”的文本。我什至无法从 html 页面中提取标题。这是我尝试过的代码：

from BeautifulSoup import BeautifulSoup
import requests

url= 'http://fortune.com/best-companies/'    
r = requests.get(url)

soup = BeautifulSoup(r.text)

#print soup.prettify()[0:1000]
print soup.find_all("title")

letters = soup.find_all("div", class_="company-desc-and-sort-container")

我收到以下错误：

 print soup.find_all("title")
TypeError: 'NoneType' object is not callable

【问题讨论】：

你的 beautifulsoup 版本是什么？

标签： python beautifulsoup html-parsing

【解决方案1】：

您使用的是BeautifulSoup 版本3，它不仅不再维护，而且没有find_all() 方法。而且，由于点符号被用作find() 的快捷方式，BeautifulSoup 尝试查找带有“find_all”标签名称的元素，结果为None。然后，它会执行None("title")，结果是：

TypeError: 'NoneType' 对象不可调用

升级到BeautifulSoup第4版，替换：

from BeautifulSoup import BeautifulSoup

与：

from bs4 import BeautifulSoup

确保已安装beautifulsoup4 包：

pip install --upgrade beautifulsoup4

【讨论】：

【解决方案2】：

soup.find_all("title")

未找到标题标签并返回“无”。如果“find_all”方法确实找到了一些东西，它也会返回一个列表，你会得到一个不同的错误。您不能打印列表。仅使用“查找”方法。这将做第一个标题标签。

那么html页面还有title标签吗？搜索，如果没有则仅打印。

【讨论】：