在 BeautifulSoup 中同时选择两个 HTML 元素？答案

【问题标题】：Select two HTML elements at the same time in BeautifulSoup?在 BeautifulSoup 中同时选择两个 HTML 元素？
【发布时间】：2021-12-10 10:43:00
【问题描述】：

我有一个看起来像这样的 HTML 文件：

<div class="mon_title">[CURRENT DATE]</div>
<table class="mon_list" >[contents of the table]</table>
[OHER CODE]
<div class="mon_title">[ANOTHER DATE]</div>
<table class="mon_list" >[contents of another table]</table>
[repeats a few times over]

我的最终目标是提取表格并以某种方式为每个表格添加相应的日期。

使用这段代码我成功地只提取了表格：

tables = soup.find_all("table", {"class": "mon_list"})

我的问题是如何提取日期和表格，并以某种方式将相应的日期添加到每个表格。

【问题讨论】：

为每个表添加相应的日期 - 你能详细说明一下吗？您希望将日期作为一行插入到表格中吗？
这就是我为什么这么说的原因。如果结果只是在仍然以 html 格式在每个表上打印日期或将其添加到表本身，这对我来说真的无关紧要。
检查我的答案。

标签： python html web-scraping beautifulsoup html-table

【解决方案1】：

find_all支持自定义函数，docs。

这里是使用示例

html = """<div class="mon_title">[CURRENT DATE]</div>
<table class="mon_list" >[contents of the table]</table>
<div class="mon_title">[ANOTHER DATE]</div>
<table class="mon_list" >[contents of another table]</table><span>hhhh</span>"""

import bs4

soup = bs4.BeautifulSoup(html, 'lxml')

def finder(tag1, tag2):
    def _wrapper(tag):
        if tag.name == tag1 or tag.name == tag2:
            return True   
    return _wrapper

tags = soup.find_all(finder('table', 'div'))

print([tag.text if tag.name == 'div' else tag for tag in soup.find_all(finder('table', 'div'))])

输出

['[CURRENT DATE]', <table class="mon_list">[contents of the table]</table>, '[ANOTHER DATE]', <table class="mon_list">[contents of another table]</table>]

【讨论】：

【解决方案2】：

你可以这样做。

使用find_all()选择类名为mon_list的<table>
对于上面选择的每个表，由于日期 <div> 出现在 <table> 元素之前，您可以使用 .findPreviousSibling() 方法选择它。
```
.findPreviousSibling('div', class_='mon_title')
```

这是先打印日期，然后是表格数据的完整代码。

from bs4 import BeautifulSoup
s = """
<div class="mon_title">[CURRENT DATE]</div>
<table class="mon_list" >[contents of the table]</table>
[OHER CODE]
<div class="mon_title">[ANOTHER DATE]</div>
<table class="mon_list" >[contents of another table]</table>"""

soup = BeautifulSoup(s, 'lxml')
tabs = soup.find_all('table', class_='mon_list')
for tab in tabs:
    date_div = tab.findPreviousSibling('div', class_='mon_title')
    print(f"Date: {date_div.text.strip()}\nTable Data: {tab.text.strip()}\n")

Date: [CURRENT DATE]
Table Data: [contents of the table]

Date: [ANOTHER DATE]
Table Data: [contents of another table]

【讨论】：