使用 BeautifulSoup 通过第二个属性获取 XML 标记答案

【问题标题】：Grab XML tags by second attributes with BeautifulSoup使用 BeautifulSoup 通过第二个属性获取 XML 标记
【发布时间】：2014-03-29 22:53:45
【问题描述】：

我正在尝试提取 NextBus 数据，特别是此处看到的实时公交 GPS：http://webservices.nextbus.com/service/publicXMLFeed?command=vehicleLocations&a=sf-muni&r=N&t=0

其中有如下标签：

<vehicle id="1534" routeTag="N" dirTag="N__OB1" lat="37.76931" lon="-122.43249" 
         secsSinceReport="99" predictable="true" heading="265" speedKmHr="37"/>

我正在学习 python，并且已经成功地根据属性提取了一个标签。但我正在为除 id 之外的任何属性而苦苦挣扎。

所以这行得通：

soup.findAll("vehicle", {"id":"1521"})[1]

但这会返回一个空集

soup.findAll("vehicle", {"routeTag":"N"})

有什么原因吗？

另外，正如我所提到的，我是 Python 新手，所以如果你有最喜欢的抓取教程，请随时发表评论！

【问题讨论】：

除非您明确告诉 BeautifulSoup 解析为 XML（仅适用于安装了 lxml），否则所有内容都小写，因为在 HTML 标记中匹配不区分大小写。
BeautifulSoup raise AttributeError when xml tag name contains capital letters的可能重复

标签： python xml beautifulsoup

【解决方案1】：

要使其工作，您应该将xml 传递给BeautifulSoup 构造函数：

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://webservices.nextbus.com/service/publicXMLFeed?command=vehicleLocations&a=sf-muni&r=N&t=0'
soup = BeautifulSoup(urlopen(url), "xml")

print soup.find_all("vehicle", {"routeTag":"N"})

打印：

[
 <vehicle heading="-4" id="1431" lat="37.72223" lon="-122.44694" predictable="false" routeTag="N" secsSinceReport="65" speedKmHr="0"/>,
 ...
]

或者，感谢@Martijn 的评论，以小写形式执行搜索：

print soup.find_all("vehicle", {"routetag": "N"})

另外，请注意您应该使用 BeautifulSoup4 和 find_all() 方法 - 第 3 个 BeautifulSoup 版本未维护。

【讨论】：