Python - 将 bs4 用于嵌套的 html 标签答案

【问题标题】：Python - Using bs4 for nested html tagsPython - 将 bs4 用于嵌套的 html 标签
【发布时间】：2017-08-11 02:35:07
【问题描述】：

我需要从这个 HTML 代码中打印出 USA 和 Canada 字样：

<div class="txt-block">
    <h4 class="inline">Country:</h4>
    <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
    <span class="ghost">|</span>
    <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
</div>

我怎样才能得到bs4的单词？我用谷歌搜索了它，但没有发现任何有用的东西。

【问题讨论】：

您究竟尝试了什么，有什么问题？如果答案看起来很可能是“什么都没有”，那么就离开并改变它。
到目前为止您尝试过什么？你的bs4代码是什么？

标签： python html python-3.x beautifulsoup

【解决方案1】：

如果这就是你所拥有的，你可以为每个 a 标签使用 get_text。请试试这个

from bs4 import BeautifulSoup
html="""<div class="txt-block">
    <h4 class="inline">Country:</h4>
        <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
              <span class="ghost">|</span>
        <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
    </div>"""
soup = BeautifulSoup(html, 'html.parser')
[atag.get_text() for atag in soup.find_all('a')]

【讨论】：

【解决方案2】：

要获取文本，以下代码将起作用：

from bs4 import BeautifulSoup
html_string = """<div class="txt-block">
    <h4 class="inline">Country:</h4>
        <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
              <span class="ghost">|</span>
        <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
    </div>"""

soup = BeautifulSoup(html_string)
print([node.string for node in soup.find_all('a', attrs={"itemprop" : "url"})] )

上面的代码会导致：

[u'USA', u'Canada']

您可以参考 BeautifulSoup Documentation here。它非常易于使用且直接。

此外，您可以在 lxml 的帮助下执行此操作，这比 BeautifulSoup 快一个数量级。

from lxml import html
html_string = """<div class="txt-block">
    <h4 class="inline">Country:</h4>
        <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
              <span class="ghost">|</span>
        <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
    </div>"""

root = html.fromstring(html_string)
print(root.xpath('//a[@itemprop="url"]//text()'))

这也会导致：

['USA', 'Canada']

【讨论】：

【解决方案3】：

简单的方法findAll可以单独提取国家名称。以下是 Python 3 中的解决方案代码：

from bs4 import BeautifulSoup
html ="""
<div class="txt-block">
    <h4 class="inline">Country:</h4>
    <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
    <span class="ghost">|</span>
    <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
</div>
"""
soup = BeautifulSoup(html,"html.parser")
for i in soup.findAll("a"):
    print(i.text)

上述代码的执行会给你想要的结果：

USA
Canada

【讨论】：