使用页面文本选择 `html` 元素 using `Beautiful Soup`答案

【问题标题】：Using page text to select `html` element using`Beautiful Soup`使用页面文本选择 `html` 元素 using `Beautiful Soup`
【发布时间】：2014-11-27 09:34:21
【问题描述】：

我有一个页面包含多个重复：<div...><h4>...<p>... 例如：

html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

如果我写print soup.select('div[class^="proletariat"] > h4 ~ p')，我会得到：

[<p>Ignore this text</p>, <p>This is the text we want</p>]

我如何指定我只想要前面有<h4>hammer</h4> 的p 文本？

谢谢

【问题讨论】：

标签： python html css-selectors beautifulsoup

【解决方案1】：

html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

print(soup.find("h4", text=re.compile('hammer')).next_sibling.next.text)
This is the text we want

【讨论】：

不用担心，不客气，您需要稍微调整一下，但 next_sibling 是您所需要的。

【解决方案2】：

:contains() 可以在这里提供帮助，但不支持。

考虑到这一点，您可以将select() 与find_next_sibling() 结合使用：

print next(h4.find_next_sibling('p').text 
           for h4 in soup.select('div[class^="proletariat"] > h4') 
           if h4.text == "hammer")

【讨论】：