【问题标题】:Skip a div class with in a div web scraping在 div 网页抓取中跳过 div 类
【发布时间】:2020-01-22 20:17:01
【问题描述】:

我正在尝试抓取一个网站,我的示例 html 如下所示

<div class="ism-true"><!-- message -->
                    <div id="post_message_5437898" data-spx-slot="1">

                        OK, although it's been several weeks since I installed the 

    <div><label>Quote:</label></div>
    <div class="panel alt2" style="border:1px inset">

        <div>
            Originally Posted by <strong>DeltaNu1142</strong>
        </div>
        <div style="font-style:italic">The very first thing I did </div>

    </div>
</div>When I got my grille back from the paint shop, I went to work on the
                    </div>
                    <!-- / message --></div>

<div class="ism-true"><!-- message -->
                    <div id="post_message_5125716">

                        <div style="margin:1rem; margin-top:0.3rem;">
    <div><label>Quote:</label></div>
    <div class="panel alt2" style="border:1px inset">

        <div>
            Originally Posted by <strong>HCFX2013</strong>
        </div>
        <div style="font-style:italic">I must be the minority that absolutely can't .</div>

    </div>
</div>Hello World.
                    </div>
                    <!-- / message --></div>

我想要仅在帖子消息类中但不在“面板 alt2”类中的文本。 "div id="post_message_" 中类的位置不断变化。如何忽略面板 alt2 类中的文本。

我的代码。

text = []
for item in soup.findAll('div',attrs={"class":"ism-true"}):
    result = [item.get_text(strip=True, separator=" ")]
    div = item.find('div', class_="panel alt2")
    if div :
        result[0] = ' '.join(result[0].split(div.text.split()[-1])[1:])
        text.append(result[0])
    else:
        text.append(result)

上面的代码只在“Panel alt2”是 div 类中的第一类时给我文本。如果类的位置发生变化并将我的错误抛出为“列表索引超出范围”,它就不适用了。你能帮我忽略这些课程吗? 预期结果是

[OK, although it's been several weeks. When I got my grille back from the paint shop, I went to work on the],[Hello world]

示例网站 (https://www.f150forum.com/f118/fab-fours-black-steel-elite-bumper-adaptive-cruise-relocation-bracket-387234/)

【问题讨论】:

  • 我认为您的 html 格式不正确,因为无法访问“Hello world”,因为它被封闭标签包围
  • 我已经编辑了我的html。
  • 你到底想从那个网站得到什么?
  • @anonymous13 查看我的编辑。

标签: python web-scraping beautifulsoup


【解决方案1】:

一种可行的方法是使用panel alt2 类和label 标记将extract 移出div。以下代码似乎适用于该网站以及您的示例 html。

import requests
from bs4 import BeautifulSoup
URL = 'https://www.f150forum.com/f118/fab-fours-black-steel-elite-bumper-adaptive-cruise-relocation-bracket-387234/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
text = []
for div in soup.find_all('div', class_="ism-true"):
    try:
        div.find('div', class_="panel alt2").extract()
    except AttributeError:
        pass  # sometimes there is no 'panel alt2'
    try:
        div.find('label').extract()
    except AttributeError:
        pass  # sometimes there is no 'Quote'
    text.append(div.text.strip())

print(text)

您的样本输出:

["OK, although it's been several weeks since I installed the \n\n    \n\nWhen I got my grille back from the paint shop, I went to work on the", 'Hello World.']

如果不需要,您可以删除 换行符 字符

【讨论】:

  • 这带来了:["OK, although it's been several weeks since I installed the \n \n Quote:\n\n\n Originally Posted by DeltaNu1142\n\nThe very first thing I did", "Quote:\n\n\n Originally Posted by HCFX2013\n\nI must be the minority that absolutely can't .\n\nHello World."] 这比 OP 要求的要多
  • @JuanC 感谢您引起我的注意。我没有尝试使用他在此处发布的 html 并直接访问链接。我专门回答这个问题:I want text which is only in post message class but not in "panel alt2" class. The position of class within "div id="post_message_" keeps changing. How can I ignore the text with in the panel alt2 class.
  • @BittoBennichan 当引号在文本中间时它工作正常但是当引号在文本的开头时它只是忽略整个评论 - 就像链接中的 cmets (f150forum.com/f118/…)
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-01-09
  • 2019-11-05
  • 2020-12-27
  • 1970-01-01
相关资源
最近更新 更多