如何在 Python 中使用 BeautifulSoup 从 div 中获取对象？答案

【问题标题】：How to get objects from div with BeautifulSoup in Python?如何在 Python 中使用 BeautifulSoup 从 div 中获取对象？
【发布时间】：2015-04-06 18:04:10
【问题描述】：

我对 BeautifulSoup 不是很熟悉。我有类似的 html 代码（它只是其中的一部分）：

<div class="central-featured-lang lang1" lang="en">
<a class="link-box" href="//en.wikibooks.org/">
<strong>English</strong><br>
<em>Open-content textbooks</em><br>
<small>51 000+ pages</small></a>
</div>

关于我应该得到的输出（以及其他语言）：

English: 51 000+ pages.

我尝试了类似的方法：

for item in soup.find_all('div'):
    print item.get('class')

但这不起作用。你能帮助我，或者至少能找到解决办法吗？

【问题讨论】：

标签： python html parsing beautifulsoup

【解决方案1】：

item.get() 返回属性值，而不是元素下包含的文本。

您可以使用Element.string attribute 获取直接包含在元素中的文本，或者使用Element.get_text() method 获取所有包含的文本（递归）。

在这里，我会搜索具有lang 属性的 div 元素，然后使用包含的元素来查找字符串：

for item in soup.find_all('div', lang=True):
    if not (item.strong and item.small):
        continue
    language = item.strong.string
    pages = item.small.string
    print '{}: {}'.format(language, pages)

演示：

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div class="central-featured-lang lang1" lang="en">
... <a class="link-box" href="//en.wikibooks.org/">
... <strong>English</strong><br>
... <em>Open-content textbooks</em><br>
... <small>51 000+ pages</small></a>
... </div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for item in soup.find_all('div', lang=True):
...     if not (item.strong and item.small):
...         continue
...     language = item.strong.string
...     pages = item.small.string
...     print '{}: {}'.format(language, pages)
... 
English: 51 000+ pages

【讨论】：

感谢您的回答。不幸的是，这段代码在语言和页面中都抛出了AttributeError: 'NoneType' object has no attribute 'string'
@szkodnik: 那么你的 HTML 中有<div lang=".."> 元素elsewhere 没有<strong> 或<small> 子元素。你可以提防那个；我会更新的。