如何使用 BeautifulSoup 选择 div 父节点内的所有表格元素？答案

【问题标题】：How to select all table elements inside a div parent node with BeautifulSoup?如何使用 BeautifulSoup 选择 div 父节点内的所有表格元素？
【发布时间】：2021-03-08 01:10:09
【问题描述】：

我正在尝试使用自定义函数从 div 父节点中选择所有 table 元素。

这是我目前得到的：

import BeautifulSoup
import requests
import lxml

url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510'

def getTables(url):

    url = requests.get(url)
    soup=BeautifulSoup(url.text, 'lxml')

    div_component = soup.find('div', attrs={'class':'td-post-content'})
    tables = div_component.find_all('table', attrs={'class':'listas'})

    return tables

但是，当应用为getTables(url) 时，输出是一个空列表[]。

我希望这个函数返回 div 节点内的所有 html 表格元素给定他的特定属性。

如何调整此功能？

我可以使用任何其他库来完成这项任务吗？

【问题讨论】：

标签： python html function web-scraping beautifulsoup

【解决方案1】：

采纳其他评论者的意见，并对其进行扩展。

您的 div_component 返回 1 个元素并且不包含表格，但使用 find_all() 会产生 8 个元素：

len(soup.find_all('div', attrs={'class':'td-post-content'}))

因此，您不能只在列表上使用find()，您需要遍历它以找到包含表格的div。

另一种方式来获取你想要的表格，你可以使用

tables = soup.find_all('table', attrs={'class':'listas'})

tables 是一个包含 6 个元素的列表。如果您知道自己想要哪个表，则可以遍历这些表，直到找到您想要的那个。

【讨论】：

【解决方案2】：

第一个问题是“查找”只能找到第一个这样的匹配项。第一个 td-post-content

不包含任何表格。我想你想要“findall”。其次，您可以将 CSS 选择器与 BeautifulSoup 一起使用。因此，您可以不使用 attributes 参数搜索soup.findall('div.td-post-content')。

【讨论】：

css 选择器soup.select('.td-post-content .listas')