【发布时间】:2020-01-22 20:17:01
【问题描述】:
我正在尝试抓取一个网站,我的示例 html 如下所示
<div class="ism-true"><!-- message -->
<div id="post_message_5437898" data-spx-slot="1">
OK, although it's been several weeks since I installed the
<div><label>Quote:</label></div>
<div class="panel alt2" style="border:1px inset">
<div>
Originally Posted by <strong>DeltaNu1142</strong>
</div>
<div style="font-style:italic">The very first thing I did </div>
</div>
</div>When I got my grille back from the paint shop, I went to work on the
</div>
<!-- / message --></div>
<div class="ism-true"><!-- message -->
<div id="post_message_5125716">
<div style="margin:1rem; margin-top:0.3rem;">
<div><label>Quote:</label></div>
<div class="panel alt2" style="border:1px inset">
<div>
Originally Posted by <strong>HCFX2013</strong>
</div>
<div style="font-style:italic">I must be the minority that absolutely can't .</div>
</div>
</div>Hello World.
</div>
<!-- / message --></div>
我想要仅在帖子消息类中但不在“面板 alt2”类中的文本。 "div id="post_message_" 中类的位置不断变化。如何忽略面板 alt2 类中的文本。
我的代码。
text = []
for item in soup.findAll('div',attrs={"class":"ism-true"}):
result = [item.get_text(strip=True, separator=" ")]
div = item.find('div', class_="panel alt2")
if div :
result[0] = ' '.join(result[0].split(div.text.split()[-1])[1:])
text.append(result[0])
else:
text.append(result)
上面的代码只在“Panel alt2”是 div 类中的第一类时给我文本。如果类的位置发生变化并将我的错误抛出为“列表索引超出范围”,它就不适用了。你能帮我忽略这些课程吗? 预期结果是
[OK, although it's been several weeks. When I got my grille back from the paint shop, I went to work on the],[Hello world]
【问题讨论】:
-
我认为您的 html 格式不正确,因为无法访问“Hello world”,因为它被封闭标签包围
-
我已经编辑了我的html。
-
你到底想从那个网站得到什么?
-
@anonymous13 查看我的编辑。
标签: python web-scraping beautifulsoup