根据 python 上的特定标签将 HTML 字符串拆分为多个部分答案

【问题标题】：Split HTML string into sections based on specific tag on python根据 python 上的特定标签将 HTML 字符串拆分为多个部分
【发布时间】：2020-01-29 18:55:56
【问题描述】：

我对 python 还很陌生。我在论坛上呆了几天，我的问题的答案存在，但对于 javascript。

我有一个包含新闻的 html 页面，我希望只要有 H4 标记，就可以将内容解析到一个新部分。我想根据字符串的内容命名该部分，然后将这些部分调用到单独的电子邮件中（但这是以后的事情）。我似乎无法弄清楚如何创建这些部分。下面是代码的样子。如果我的问题是初步的，任何建议都非常感谢。谢谢！

'<td><h3>Andean</h3><hr/></td>
</tr><tr>
    <td><h4>Bolivia bla bla</h4></td>
</tr>             
<tr>
    <td><p>* Bolivia&bla bla text text </p></td>
</tr><tr>
    <td><h3>Brazil</h3><hr/></td>
</tr><tr>
    <td><h4>BRAZIL: bla bla</h4></td>
</tr>             
<tr>'

【问题讨论】：

标签： python html parsing sections

【解决方案1】：

非常感谢@Ajax1234 和@orangeInk 的帮助。

我仔细查看了代码，同时代码也发生了变化。我最终使用 find all h2 作为标题，使用特定类作为内容的 div，并循环遍历各个级别以创建一个数据框，其中每个级别对应于一个部分/国家。我不确定我所做的是否理想，但这就是我得到的：

comment_h2_tags = main_table.find_all('div',attrs={'class':'cr_title_in'})
comment_div_tags = main_table.find_all('div',attrs={'class':'itemBody'})

h2s = [] 
for h2_tag in comment_h2_tags:
    h2 = h2_tag
    h2 = (h2.a.text.strip())
    h2s.append(h2)
`

我现在正在手动输入国家/地区名称，但我认为 Id' 会提供更新。谢谢！

【讨论】：

【解决方案2】：

你可以使用itertools.groupby:

import itertools, re
from bs4 import BeautifulSoup as soup
r = list(filter(None, [i.find(re.compile('h3|h4')) for i in soup(s, 'html.parser').find_all('td')]))
result = [(a, list(b)) for a, b in itertools.groupby(r, key=lambda x:x.name=='h4')]
final_result = [[b.text for b in result[i][-1]]+[b.text for b in result[i+1][-1]] for i in range(0, len(result), 2)]

输出：

[['Andean', 'Bolivia bla bla'], ['Brazil', 'BRAZIL: bla bla']]

【讨论】：

【解决方案3】：

您可以使用正则表达式 (https://en.wikipedia.org/wiki/Regular_expression) “手动”执行此操作，也可以使用专门为解析 HTML (https://pypi.org/project/beautifulsoup4/) 构建的库。如果您打算进行更多的 HTML 解析，我建议您使用专门构建的库。如果您不熟悉它们，都需要一点时间来适应它们，但是两者都值得学习。

import re
from bs4 import BeautifulSoup

html_code = """<td><h3>Andean</h3><hr/></td>
</tr><tr>
    <td><h4>Bolivia bla bla</h4></td>
</tr>             
<tr>
    <td><p>* Bolivia&bla bla text text </p></td>
</tr><tr>
    <td><h3>Brazil</h3><hr/></td>
</tr><tr>
    <td><h4>BRAZIL: bla bla</h4></td>
</tr>             
<tr>"""

print('* with regex:')
print(re.findall('<h4>(.*?)</h4>', html_code))

print('* with beautiful soup:')
soup = BeautifulSoup(html_code)
tmp = soup.find_all('h4')
for val in tmp:
    print(val.contents)

会输出

* with regex:
['Bolivia bla bla', 'BRAZIL: bla bla']
* with beautiful soup:
['Bolivia bla bla']
['BRAZIL: bla bla']

【讨论】：