【发布时间】:2021-04-21 23:36:31
【问题描述】:
我有一个如下的html
<div id="bodyContent" class="content mw-parser-output">
<div id="mw-content-text" style="direction: ltr;">
<h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
<span class="mw-headline" id="title_0">pomme</span>
</h1>
<details data-level="2" open="">
<summary class="section-heading"><h2 id="English">English</h2></summary>
<details data-level="3" open="">abc</details>
</details>
<details data-level="2" open="">
<summary class="section-heading"><h2 id="French">French</h2></summary>
<details data-level="3" open="">abc</details>
</details>
<details data-level="2" open="">
<summary class="section-heading"><h2 id="Norman">Norman</h2></summary>
<details data-level="3" open="">abc</details>
</details>
</div>
</div>
在每个元素<details data-level="2" open=""> 中,都有一个元素<h2 id="English">English</h2> 来表示语言。我的目标是删除所有语言与English 不同的<details data-level="2" open="">。我的预期结果是
<div id="bodyContent" class="content mw-parser-output">
<div id="mw-content-text" style="direction: ltr;">
<h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
<span class="mw-headline" id="title_0">pomme</span>
</h1>
<details data-level="2" open="">
<summary class="section-heading"><h2 id="English">English</h2></summary>
<details data-level="3" open="">abc</details>
</details>
</div>
</div>
我得到这样的结果
from bs4 import BeautifulSoup
texte = """
<div id="bodyContent" class="content mw-parser-output">
<div id="mw-content-text" style="direction: ltr;">
<h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
<span class="mw-headline" id="title_0">pomme</span>
</h1>
<details data-level="2" open="">
<summary class="section-heading"><h2 id="English">English</h2></summary>
<details data-level="3" open="">abc</details>
</details>
</div>
</div>
"""
soup = BeautifulSoup(texte, 'html.parser')
tmp = soup.select('details > summary > h2')
tmp2 = [s.contents[0] for s in tmp]
for i in range(len(tmp2)):
if tmp2[i] != 'English':
tmp[i].find_parent('details').decompose()
soup
我需要重复这个操作近 400 万次。我想问有没有更有效的方法来做到这一点。非常感谢您的帮助!
【问题讨论】:
-
也许尝试用 Rabin-Karp 字符串搜索做一些事情,在其中你对字符串进行哈希处理,并且仅在哈希匹配时才查看字符串......我只是想考虑匹配字符串的策略在 for 循环中更快。 @Andrej Kesely 使用 BeautifulSoup 可能更快......
标签: python beautifulsoup