BeautifulSoup 中的 findAll() 会跳过多个 id答案

【问题标题】：findAll() in BeautifulSoup skips over multiple idsBeautifulSoup 中的 findAll() 会跳过多个 id
【发布时间】：2018-05-17 19:10:50
【问题描述】：

我在图片标签中有一个包含多个 id 的字符串：

<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" /> 

soup = bs4.BeautifulSoup(webpage,"html.parser")
images = soup.findAll('img')
for image in images:
    print image

以上代码只返回id=comp-jefxldtzbalatamediacontentimage

更换

soup = bs4.BeautifulSoup(webpage,"html.parser")

与

soup = bs4.BeautifulSoup(webpage,"lxml")

返回第一个id webfast-uhyubv

但是，我想按照它们在输入行中存在的顺序获取两个 ID。

【问题讨论】：

这段代码只获取第一个 id 而不是第二个
@Rachit 它取决于解析器。

标签： python beautifulsoup html-parsing

【解决方案1】：

BeautifulSoup 存储attributes of a tag in a dictionary。由于字典不能有重复的键，一个id 属性会覆盖另一个。您可以使用tag.attrs 查看属性字典。

>>> soup = BeautifulSoup(tag, 'html.parser')
>>> soup.img.attrs
{'id': 'comp-jefxldtzbalatamediacontentimage', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

>>> soup = BeautifulSoup(tag, 'lxml')
>>> soup.img.attrs
{'id': 'webfast-uhyubv', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

如您所见，我们使用不同的解析器为id 得到不同的值。这发生在different parsers work differently。

无法使用 BeautifulSoup 获得两个 id 值。您可以使用 RegEx 获取它们。但是，use it carefully and as a last resort!

>>> import re
>>> tag = '<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" />'
>>> ids = re.findall('id="(.*?)"', tag)
>>> ids
['webfast-uhyubv', 'comp-jefxldtzbalatamediacontentimage']

【讨论】：

感谢您的详细回复。鉴于存在的 HTML 变体，我现在将使用正则表达式方法。