BeautifulSoup 排除特定标签内的内容答案

【问题标题】：BeautifulSoup exclude content within a certain tag(s)BeautifulSoup 排除特定标签内的内容
【发布时间】：2015-02-20 23:55:35
【问题描述】：

我有以下项目来查找段落中的文本：

soup.find("td", { "id" : "overview-top" }).find("p", { "itemprop" : "description" }).text

如何排除 <a> 标记中的所有文本？ in <p> but not in <a> 之类的东西？

【问题讨论】：

标签： python html beautifulsoup html-parsing lxml

【解决方案1】：

在p 标记中查找并加入所有text nodes，并检查其父级是否不是a 标记：

p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"})

print ''.join(text for text in p.find_all(text=True) 
              if text.parent.name != "a")

演示（见无 link text 打印）：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <td id="overview-top">
...     <p itemprop="description">
...         text1
...         <a href="google.com">link text</a>
...         text2
...     </p>
... </td>
... """
>>> soup = BeautifulSoup(data)
>>> p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"})
>>> print p.text

        text1
        link text
        text2
>>>
>>> print ''.join(text for text in p.find_all(text=True) if text.parent.name != "a")

        text1

        text2

【讨论】：

【解决方案2】：

使用 lxml，

import lxml.html as LH

data = """
<td id="overview-top">
    <p itemprop="description">
        text1
        <a href="google.com">link text</a>
        text2
    </p>
</td>
"""

root = LH.fromstring(data)
print(''.join(root.xpath(
    '//td[@id="overview-top"]//p[@itemprop="description"]/text()')))

产量

        text1

        text2

要同时获取<p> 的子标签的文本，只需使用双正斜杠//text()，而不是单个正斜杠：

print(''.join(root.xpath(
    '//td[@id="overview-top"]//p[@itemprop="description"]//text()')))

产量

        text1
        link text
        text2

【讨论】：