python - 如何使用beautifulsoup获取网页中某个文本之前的所有标签？答案

【问题标题】：python - How to get all the tags before a certain text in a webpage with beautifulsoup?python - 如何使用beautifulsoup获取网页中某个文本之前的所有标签？
【发布时间】：2017-11-26 15:39:40
【问题描述】：

我的网站有很多 标签。我想拥有所有 标签，这些标签写在网页中某个独特的文本之前。我怎样才能做到这一点？

<p>p1</p>
<p>p2</p>
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>

所以我想得到 [p1,p2,p3] 的列表，但我不想要 p4 和 p5。

【问题讨论】：

标签： python html parsing beautifulsoup web-crawler

【解决方案1】：

您可以在find_all 中使用function 来选择“p”标签，前提是它们之前的所有同级标签都不包含特定文本，例如：

html = '''
<p>p1</p>
<p>p2</p> 
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>
'''
soup = BeautifulSoup(html, 'html.parser')

def select_tags(tag, text='certain unique text'):
    return tag.name=='p' and all(text not in t.text for t in tag.find_previous_siblings())

print(soup.find_all(select_tags))

[p1, p2, p3]

【讨论】：

【解决方案2】：

除了 t.m.adam 先生已经展示的内容之外，您也可以像这样从出现在类 zls 之前的那些 p 标签中获取文本：

from bs4 import BeautifulSoup

html_content = '''
<t>p0</t>
<y>p00</y> 
<p>p1</p>
<p>p2</p> 
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>
'''
soup = BeautifulSoup(html_content, 'lxml')

for items in soup.select(".zls"):
    tag_items = [item.text for item in items.find_previous_siblings() if item.name=="p"]
    print(tag_items)

输出：

['p3', 'p2', 'p1']

【讨论】：