【发布时间】:2016-12-04 16:47:55
【问题描述】:
我正在尝试从 2012 年奥巴马-罗姆尼总统辩论中提取引语。问题是the site 组织得不好。所以结构看起来是这样的:
<span class="displaytext">
<p>
<i>OBAMA</i>Obama's first quotes
</p>
<p>More quotes from Obama</p>
<p>Some more Obama quotes</p>
<p>
<i>Moderator</i>Moderator's quotes
</p>
<p>Some more quotes</p>
<p>
<i>ROMNEY</i>Romney's quotes
</p>
<p>More quotes from Romney</p>
<p>Some more Romney quotes</p>
</span>
有没有办法选择一个<p>,它的第一个孩子是i,它的文本是OBAMA,并且都是p兄弟姐妹,直到你点击下一个p,它的第一个孩子是@987654328 @那个没有文字Obama??
这是我到目前为止尝试过的,但它只是抓住了第一个 p 忽略了兄弟姐妹
input = '''<span class="displaytext">
<p>
<i>OBAMA</i>Obama's first quotes
</p>
<p>More quotes from Obama</p>
<p>Some more Obama quotes</p>
<p>
<i>Moderator</i>Moderator's quotes
</p>
<p>Some more quotes</p>
<p>
<i>ROMNEY</i>Romney's quotes
</p>
<p>More quotes from Romney</p>
<p>Some more Romney quotes</p>
</span>'''
soup = BeautifulSoup(input)
debate_text = soup.find("span", { "class" : "displaytext" })
president_quotes = debate_text.find_all("i", text="OBAMA")
for i in president_quotes:
siblings = i.next_siblings
for sibling in siblings:
print(sibling)
仅打印Obama's first quotes
【问题讨论】:
标签: python python-3.x web-scraping beautifulsoup