【问题标题】:BeautifulSoup find_all() Doesn't Find All Requested ElementsBeautifulSoup find_all() 找不到所有请求的元素
【发布时间】:2018-03-17 15:11:35
【问题描述】:

我看到 BeautifulSoup 出现一些奇怪的行为,如下例所示。

import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')
pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
paras = soup.find_all('p', string=pattern)
print(len(paras)) # expected to find 3 paragraphs with word "color" in it
  2
print(paras[0].prettify())
  <p class="blue">
    This paragraph as a color of blue.
  </p>

print(paras[1].prettify())
  <p>
    This paragraph does not have a color.
  </p>

正如您所见,由于某种原因,&lt;p style='color: red;'&gt;This has a &lt;b&gt;color&lt;/b&gt; of red. Because it likes the color red&lt;/p&gt; 的第一段没有被 find_all(...) 接收,我不知道为什么没有。

【问题讨论】:

    标签: python python-2.7 beautifulsoup


    【解决方案1】:

    string 属性要求标签只包含文本而不包含标签。如果您尝试为第一个 p 标签打印 .string,它将返回 None,因为其中包含标签。

    或者,为了更好地解释,documentation 说:

    如果标签只有一个子标签,并且该子标签是 NavigableString,则该标签将作为 .string 提供

    如果一个标签包含不止一个东西,那么.string应该指的是什么就不清楚了,所以.string被定义为None

    解决这个问题的方法是使用lambda 函数。

    html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
    <p class='blue'>This paragraph has a color of blue.</p>
    <p>This paragraph does not have a color.</p>"""
    soup = BeautifulSoup(html, 'html.parser')
    
    first_p = soup.find('p')
    print(first_p)
    # <p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>
    print(first_p.string)
    # None
    print(first_p.text)
    # This has a color of red. Because it likes the color red
    
    paras = soup.find_all(lambda tag: tag.name == 'p' and 'color' in tag.text.lower())
    print(paras)
    # [<p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>, <p class="blue">This paragraph has a color of blue.</p>, <p>This paragraph does not have a color.</p>]
    

    【讨论】:

      【解决方案2】:

      如果你想抓住'p',你可以这样做:

      import re
      from bs4 import BeautifulSoup
      html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
      <p class='blue'>This paragraph has a color of blue.</p>
      <p>This paragraph does not have a color.</p>"""
      soup = BeautifulSoup(html, 'html.parser')
      
      paras = soup.find_all('p')
      for p in paras:
        print (p.get_text())
      

      【讨论】:

        【解决方案3】:

        我实际上还没有弄清楚为什么指定 find_all(...) 的字符串(或旧版本的 BeautifulSoup 的文本)参数并没有给我想要的,但是,以下确实给了我一个通用的解决方案。

        pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
        desired_tags = [tag for tag in soup.find_all('p') if pattern.search(tag.text) is not None]
        

        【讨论】:

        • 我已经在上面的答案中解释了为什么会发生这种情况。您作为列表理解所做的工作也可以使用 lambda 来完成。
        猜你喜欢
        • 2013-11-17
        • 2018-03-28
        • 2015-02-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-24
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多