【问题标题】:A more efficient way to delete elements whose children contain "English"删除子元素包含“英语”的更有效方法
【发布时间】:2021-04-21 23:36:31
【问题描述】:

我有一个如下的html

<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>
        
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="French">French</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="Norman">Norman</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>

在每个元素&lt;details data-level="2" open=""&gt; 中,都有一个元素&lt;h2 id="English"&gt;English&lt;/h2&gt; 来表示语言。我的目标是删除所有语言与English 不同的&lt;details data-level="2" open=""&gt;。我的预期结果是

<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>

我得到这样的结果

from bs4 import BeautifulSoup

texte = """
<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>
"""

soup = BeautifulSoup(texte, 'html.parser')
tmp = soup.select('details > summary > h2')
tmp2 = [s.contents[0] for s in tmp]

for i in range(len(tmp2)):
    if tmp2[i] != 'English':
        tmp[i].find_parent('details').decompose()
        
soup

我需要重复这个操作近 400 万次。我想问有没有更有效的方法来做到这一点。非常感谢您的帮助!

【问题讨论】:

  • 也许尝试用 Rabin-Karp 字符串搜索做一些事情,在其中你对字符串进行哈希处理,并且仅在哈希匹配时才查看字符串......我只是想考虑匹配字符串的策略在 for 循环中更快。 @Andrej Kesely 使用 BeautifulSoup 可能更快......

标签: python beautifulsoup


【解决方案1】:

您可以将 CSS 选择器与 :not().extract() 选定元素一起使用:

for d in soup.select('details[data-level="2"]:not(:has(h2#English))'):
    d.extract()

print(soup.prettify())

打印:

<div class="content mw-parser-output" id="bodyContent">
 <div id="mw-content-text" style="direction: ltr;">
  <h1 aria-haspopup="true" class="section-heading" data-section-id="0" tabindex="0">
   <span class="mw-headline" id="title_0">
    pomme
   </span>
  </h1>
  <details data-level="2" open="">
   <summary class="section-heading">
    <h2 id="English">
     English
    </h2>
   </summary>
   <details data-level="3" open="">
    abc
   </details>
  </details>
 </div>
</div>

【讨论】:

  • 嗨 Andrej,你能看看 this question 并详细说明为什么 soup.find('details[data-level="2"]:has(h2#English)') 不起作用吗?
  • @LEAnhDung .find() 方法不接受 CSS 选择器。请改用.select().select_one()
  • 非常感谢安德烈。你的回答一如既往的优雅:))
猜你喜欢
  • 2010-10-05
  • 2020-05-14
  • 2012-10-22
  • 2011-02-24
  • 1970-01-01
  • 2013-05-29
  • 1970-01-01
  • 2013-11-24
  • 2017-10-02
相关资源
最近更新 更多