通过文本搜索时汤没有找到正确的 div 标签答案

【问题标题】：Soup not locating proper div tag when searched by text通过文本搜索时汤没有找到正确的 div 标签
【发布时间】：2020-09-30 21:35:06
【问题描述】：

这是实际 html 的精简版，包含更多标签。

html = '''

<div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        ¨
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        No
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        x
    </font>

</div>

<div style="line-height:120%;padding-top:12px;text-align:left;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        There were
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        33,012,179
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        shares of common stock, $.01 par value per share, outstanding at
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        July&nbsp;26, 2017
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        .
    </font>

</div>

'''

我正在尝试根据文本定位 tag。文本是regex 的一种形式，它全部位于div 标记内。

month_pattern = r'((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\s?(\d{1,2}\D?)\s?(19[7-9]\d|20\d{2}|\d{2}))'

word_pattern = r'(?=.*common)(?=.*outstanding[.,]?)(?=.*shares[.,]?)(?=.*stock[.,]?)'


pattern = word_pattern + '.*' + month_pattern

上面的正则表达式有点复杂，但是当我在在div内的文字上。

使用下面的汤代码，我期望返回一个类型为 soup 的对象，其父对象是第一个 div 标记，但是我得到一个空列表。

soup = bs(html, 'html.parser')

elem =  soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL))
print(elem)

结果

[]

我怀疑这个问题是因为div 的文本进一步嵌套在<font> 中文本？但是，如果我执行div.text，所有的文本都会被打印出来，所以我不知道为什么我没有得到任何点击。

'''There were
    

        33,012,179
    

        shares of common stock, $.01 par value per share, outstanding at
    

        July 26, 2017
    

        .

        '''

再一次，正则表达式不是问题，因为通过 re 模块，我有：

print(re.search(pattern,text, flags = re.IGNORECASE|re.DOTALL))

结果：

<_sre.SRE_Match object; span=(0, 142), match='There were\n    \n\n        33,012,179\n    \n\n >

异常结果：

我希望elem 是一个非空列表，因此，如果我在接受的答案中运行elem.parent，

Using BeautifulSoup to find a HTML tag that contains certain text

我将能够提取第一个 div 标记及其内部 html，如下所示：

  <div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">
    
        <font style="font-family:inherit;font-size:10pt;">
            Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
        </font>
    
        <font style="font-family:Wingdings;font-size:10pt;">
            ¨
        </font>
    
        <font style="font-family:inherit;font-size:10pt;">
            No
        </font>
    
        <font style="font-family:Wingdings;font-size:10pt;">
            x
        </font>
    
    </div>

但是，我得到一个空列表，所以 elem.parent 如果我迭代什么也不返回

谢谢。

这里是简单c&p的完整代码：

#testing_html


from bs4 import BeautifulSoup as bs
import re
import os


html = '''

<div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        ¨
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        No
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        x
    </font>

</div>

<div style="line-height:120%;padding-top:12px;text-align:left;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        There were
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        33,012,179
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        shares of common stock, $.01 par value per share, outstanding at
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        July&nbsp;26, 2017
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        .
    </font>

</div>

'''

text = '''There were
    

        33,012,179
    

        shares of common stock, $.01 par value per share, outstanding at
    

        July 26, 2017
    

        .

        '''


month_pattern = r'((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\s?(\d{1,2}\D?)\s?(19[7-9]\d|20\d{2}|\d{2}))'

word_pattern = r'(?=.*common)(?=.*outstanding[.,]?)(?=.*shares[.,]?)(?=.*stock[.,]?)'


pattern = word_pattern + '.*' + month_pattern

soup = bs(html, 'html.parser')

elem =  soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL))
print(elem)

print(re.search(pattern,text, flags = re.IGNORECASE|re.DOTALL))

【问题讨论】：

您似乎正在尝试解析 EDGAR 文件，否则您的问题不清楚。鉴于问题中的示例 html，您的预期输出到底是什么？
嗨，杰克。 elem 应该不是一个空列表。相反，soup 应该能够捕获带有正则表达式匹配的文本的标签。所以就是这一行：elem = soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL)) print(elem)
恐怕它没有回答这个问题。给定您的示例 html，如果您执行 print(elem)，您希望输出是什么？
@JackFleeting 我已经更新了我的 OP（检查预期结果部分）。基本上它应该返回一个非空值。我不确定它会返回什么文本，但它应该是非空的，所以我可以使用elem.parent 来取回任何符合正则表达式标准的标签。
让我们尝试不同 - 您的示例 html 中有两个 <div> 元素。您是否尝试根据其文本获取第二个，然后找到其父级？如果是这种情况，部分问题在于您的示例 html 中没有该元素的父元素。

标签： python-3.x regex beautifulsoup

【解决方案1】：

我想我现在明白了这个问题......

您遇到的一个问题是您的最终正则表达式 pattern = word_pattern + '.*' + month_pattern 找不到目标文本，因为目标文本分布在多个 <font> 节点之间，因此没有单个节点具有完整模式。在这种情况下，文本分布在两个节点之间。这两个节点确实有相同的共同祖父——有问题的<div>。您可以拨打parent 两次。

这可以通过以下方式解决：

elem_m =  soup(text=re.compile(month_pattern))
elem_w =  soup(text=re.compile(word_pattern))

if elem_m[0].parent.parent==elem_w[0].parent.parent:
    print((elem_m[0].parent.parent).text.strip())

更根本的是，如果您四处搜索，您会发现在 html/xml 的上下文中使用正则表达式是非常不鼓励的。为了避免这种情况，我会这样做：

key_words = ['common','shares','stock,',"outstanding"]
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] 

for s in soup.select('*'):   
    words = all(word in s.text  for word in key_words)
    month = any(month in s.text  for month in months)
    if words == True and month == True:
        print(s.text.strip())

两种情况下的输出都是：

There were
   

   33,012,179
   

   shares of common stock, $.01 par value per share, outstanding at
   

   July 26, 2017
   
        .

祝你好运解析 EDGAR 文件；不是我能想到的最有趣的活动......

【讨论】：

是的，在多个之间传播的文本似乎是问题所在。我的印象是，Soup 会遍历父标签和子标签，因为 soup.find(div.text) 正在返回整个文本。我需要测试你的逻辑，看看是否有任何边缘情况。非常感谢您的耐心和帮助。
可以if words == True and month == True不只是if words and month:
@QHarr - 太棒了！它也可以这样工作！由于某种原因，我从来没有想过......
好的。刚好有时间检查一下。现在似乎正在发生的问题恰恰相反：我每秒填充超过 100 次点击。它包括html、body 和type、filename、description 等可以删除的标签。但我也有很多获得p 标签，其中一个真正的标签位于所有p 标签的中间，因此很难选择正确的标签。让我看看我是否可以将您链接到填充物。
这是其中一种填充物：sec.gov/ix?doc=/Archives/edgar/data/6201/000000620120000089/…我已经使用请求下载了它。如果我在上面运行你的代码，我会得到太多的点击。