如何在python中用漂亮的汤从div中find_all（id）答案

【问题标题】：How to find_all(id) from a div with beautiful soup in python如何在python中用漂亮的汤从div中find_all（id）
【发布时间】：2019-11-13 13:22:21
【问题描述】：

我想打印出具有唯一类的页面中的所有 ID。

我想用Beautiful Soup 刮的页面是这样的：

<div itemscope itemprop="item" itemtype="http://schema.org/Product" id="12345" class="realestate"> 
<div class="contentArea"> 
<meta itemprop="name" content="Name - 12345 " /> 
<meta itemprop="url" content="https://url12345.hu" />   
<meta itemprop="category" content="category1" />   
</div>
</div>
<div itemscope itemprop="item" itemtype="http://schema.org/Product" id="12346" class="realestate"> 
<div class="contentArea"> 
<meta itemprop="name" content="Name - 12346 " /> 
<meta itemprop="url" content="https://url12346.hu" />   
<meta itemprop="category" content="category1" />   
</div>
</div>

“ID”是 Itemscope DIV 中的唯一标识符，因此我想以某种方式提取这些唯一 ID 并将它们全部打印出来（原因是将所有其他广告信息附加到此 ID（如名称、URL、等）稍后）

我尝试使用此 python 代码，但它不起作用。

import requests
from bs4 import BeautifulSoup

page = requests.get('searchResultPage.url')
soup = BeautifulSoup(page.text, 'html.parser')
id = soup.find_all('id')
print(id)

它返回一个空列表。

我所期望的，我想要的是从 div 中取回一个带有 ID-s 的列表，这样： 12345 12346

提前感谢您的帮助！

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

这里有一些解决方案：如果只考虑带有 id 的标签：

tags = page_soup.find_all(id=True)
for tag in tags:
    print(tag.name,tag['id'],sep='->')

如果需要循环所有标签：

 tags = page_soup.find_all()
    for tag in tags:
        if 'id' in tag.attrs:
            print(tag.name,tag['id'],sep='->')

仅获取所有 ID：

ids =[tag['id'] for tag in page_soup.find_all(id=True)]

【讨论】：

【解决方案2】：

标签和属性之间存在区别，在您的情况下，div 是标签，id 是标签的属性。因此，您必须使用find_all(name='tag') 查找所有标签，然后才能使用get('attribute') 获取属性。如果你想抓取长页面，你可以使用理解列表来优化你的代码：

soup = BeautifulSoup(markup=page, 'html.parser')
test = [r['id'] for r in soup.find_all(name="div", attrs={"id":"12346"}) if r.get('id') is not None]

输出：

['12345', '12346']

此外，您可以使用find_all() 获取所有具有id 属性的标签（感谢 Jon Clements），例如：

test = [r['id'] for r in soup.find_all(name="div", attrs={"id":True})]

【讨论】：

您可以使用 soup.find_all('div', attrs={'id': True}) 来优化列表组合，它只返回具有 id 属性的元素，因此您的 if 在列表组合中的检查变得多余，因为该属性正在运行存在。
我不知道这个，谢谢，我更新了答案：)

【解决方案3】：

如果您想查看整个 Web URL 中的所有 ID，这将起作用，但它还会包含许多外部和内部 HTML 标记和代码。

id = soup.find_all(id=True)
print(id)

如果您想在每行一个 ID 的列表/数组中查看没有所有 HTML 的实际 ID，则可以选择：

for ID in soup.find_all('div', id=True):  
    print(ID.get('id'))

在上面的 For 循环中，您在引号中指定标签，即“div”，然后要求它列出您想要的属性，即“id=True”

【讨论】：

【解决方案4】：

HS-nebula 是正确的 find_all 寻找某种类型的标签，在你的汤 id 是一个属性而不是标签类型。要获取汤中所有 id 的列表，您可以使用以下一行

ids = [tag['id'] for tag in soup.select('div[id]')]

这使用 CSS 选择器而不是 bs4 的 find_all，因为我发现 bs4 的文档缺少关于其内置功能。

所以soup.select 所做的就是返回一个包含名为“id”属性的所有 div 元素的列表，然后我们遍历该 div 标签列表并将“id”属性的值添加到 ids 列表中。

【讨论】：

通常认为 BeautifulSoup 没有公开使用的底层解析器是一种耻辱（除非我错过了什么）......在这种情况下 - 因为它是 html.parser 没关系.. . 但如果是lxml.html，那么这将变成.xpath('//*/@id') 查询...
请注意：.select 的 BS4 方法是 soup.find_all(id=True)
我认为他们提到 bs4 在某处默认使用html.parser，但在我刚刚检查时没有看到它明确说明。不知道使用 lxml 解析器时可以使用 .xpath 查询！感谢您添加.find_all 版本。我在 bs4 文档中寻找等效的 .find_all 并且无法计算出正确的参数。
Err no... .xpath 不可用如果您在 BS4 中使用 lxml.html 作为解析器，您将无法获得底层解析的对象...如果您执行以下操作： lxml.html.fromstring(page.text).xpath('//*/@id') 会这样做，但如果你正在做类似的事情，你可能一开始就不太可能使用 bs4...
啊，好吧，误解了你的意思，谢谢你的澄清。

【解决方案5】：

BeautifulSoup 的 find_all() 函数查找某种类型的所有 HTML 标签。 id 不是标签，它是标签的属性。您必须搜索包含所需 ID 的标签，在本例中为 div 标签。

div_tags = soup.find_all('div')
ids = []
for div in div_tags:
     ID = div.get('id')
     if ID is not None:
         ids.append(ID)

BeautifulSoup 还提供了查找具有特定属性的标签的功能。

【讨论】：