如何实现python在xml标签之间查找值？答案

【问题标题】：How to implement python to find value between xml tags?如何实现python在xml标签之间查找值？
【发布时间】：2010-06-17 15:52:15
【问题描述】：

我正在使用谷歌网站检索天气信息，我想在 XML 标记之间查找值。以下代码给了我一个城市的天气状况，但我无法获得其他参数，例如温度，如果可能的话，解释代码中隐含的拆分函数的工作：

import urllib

def getWeather(city):

    #create google weather api url
    url = "http://www.google.com/ig/api?weather=" + urllib.quote(city)

    try:
        # open google weather api url
        f = urllib.urlopen(url)
    except:
        # if there was an error opening the url, return
        return "Error opening url"

    # read contents to a string
    s = f.read()

    # extract weather condition data from xml string
    weather = s.split("<current_conditions><condition data=\"")[-1].split("\"")[0]

    # if there was an error getting the condition, the city is invalid


    if weather == "<?xml version=":
        return "Invalid city"

    #return the weather condition
    return weather

def main():
    while True:
        city = raw_input("Give me a city: ")
        weather = getWeather(city)
        print(weather)

if __name__ == "__main__":
    main()

谢谢

【问题讨论】：

参见相关的stackoverflow.com/questions/3106480了解基于XML解析器使用的解决方案

标签： python

【解决方案1】：

USE

PARSER

您无法使用正则表达式解析 XML，所以不要尝试。这是start to finding an XML parser in Python。这是good site for learning about parsing XML in Python。

更新：鉴于有关 PyS60 的新信息，这里是诺基亚网站上的documentation for using XML。

更新 2：@Nas Banov 已请求示例代码，所以这里是：

import urllib

from xml.parsers import expat

def start_element_handler(name, attrs):
    """
    My handler for the event that fires when the parser sees an
    opening tag in the XML.
    """
    # If we care about more than just the temp data, we can extend this
    # logic with ``elif``. If the XML gets really hairy, we can create a
    # ``dict`` of handler functions and index it by tag name, e.g.,
    # { 'humidity': humidity_handler }
    if 'temp_c' == name:
        print "The current temperature is %(data)s degrees Celsius." % attrs

def process_weather_conditions():
    """
    Main logic of the POC; set up the parser and handle resource
    cleanup.
    """
    my_parser = expat.ParserCreate()
    my_parser.StartElementHandler = start_element_handler

    # I don't know if the S60 supports try/finally, but that's not
    # the point of the POC.
    try:
        f = urllib.urlopen("http://www.google.com/ig/api?weather=30096")
        my_parser.ParseFile(f)
    finally:
        f.close()

if __name__ == '__main__':
    process_weather_conditions()

【讨论】：

感谢您的链接，但我真的很想知道上面的拆分函数是如何实现的，以及为什么不能使用相同的方法来查找 temp_c 标记值，我是 python 中的泰罗，因为我的模块使用是有限的
正则表达式显然不足以进行通用 XML 解析，这是一个原因（在许多可能的原因中）：XML 可以具有任意嵌套的标签。对于单个特定文档（不是方案，实际的 XML 文档），您有时可以使用正则表达式获得有用的值。当用于类似的文档（XML 解析器可以很好地处理）时，这种 hack 将失败（通常在生产中），因为格式不同，或者新文档在新标签中有一些新数据，等等。
真的吗？！不能使用正则表达式来解析 any（与 all 相比）类型的 xml？即使考虑到您链接到的 PyS60 站点指向“一组可以非常轻松有效地解析 XML 内容的正则表达式”。 ...即使我们认为 DTD 基于正则表达式并且现在正则表达式涵盖的范围远远超过“常规语言”类？
@EnTerr 我很清楚你可以组合一个正则表达式来从特定的 XML 文档中提取数据，因为我的评论说“正则表达式显然不足以用于通用 XML 解析......”同样清楚的是，您了解尝试使用正则表达式来处理您无法控制的 XML 充其量是脆弱的。我不明白为什么你觉得有必要与你自己创造的稻草人争论，或者为什么你会鼓励 Harshit 继续采用你知道是脆弱的方法。
@Hank Gay：在您的 main 响应中，您说“您无法使用正则表达式解析 XML，所以不要尝试”。这是不正确的，这就是我要说的。您需要澄清这一点，而不仅仅是对自己的一些评论。当您可以直接写下脚注时，您不能指望我或其他人阅读您的脚注。此外，您说无论钉子的大小都应该始终使用大锤，这是教条主义的。你没看到 OP 在 split() 上有问题，但你想用 DOM 打破他的背吗？

【解决方案2】：

我建议使用 XML Parser，就像 Hank Gay 建议的那样。我个人的建议是lxml，因为我目前正在一个项目中使用它，它扩展了标准库 (xml.etree) 中已经存在的非常有用的 ElementTree 接口。

Lxml 添加了对 xpath、xslt 和标准 ElementTree 模块中缺少的各种其他功能的支持。

无论您选择哪种方式，XML 解析器都是迄今为止最好的选择，因为您将能够将 XML 文档作为 Python 对象来处理。这意味着您的代码将类似于：

# existing code up to...
s = f.read()
import lxml.etree as ET
tree = ET.parse(s)
current = tree.find("current_condition/condition")
condition_data = current.get("data")
weather = condition_data
return weather

【讨论】：

感谢您的回复，但我正在 PyS60 中编程，需要执行模块使用受限的任务
好吧，您可以使用标准库中的 xml.etree 模块轻松完成相同的功能。您无需安装任何东西。谷歌搜索一下，这个模块似乎包含在 Py60 子集中：pys60.garage.maemo.org/doc/lib/…
好的，我尝试导入 cElementTree ，我想这会有所帮助，将在实施后确认。再次感谢
Traceback（最近一次调用最后）：文件“weatxml.py”，第 36 行，在 main() 文件“weatxml.py”，第 32 行，主要天气 = getWeather(city ) 文件“weatxml.py”，第 18 行，在 getWeather tree=ET.parse(s) 文件“/usr/lib/python2.6/xml/etree/ElementTree.py”，第 862 行，在 parse tree.parse(源，解析器）文件“/usr/lib/python2.6/xml/etree/ElementTree.py”，第 579 行，解析源 = open（source，“rb”）IOError：[Errno 2] 没有这样的文件或目录: '
很抱歉 - 我并不是说我的代码会完全按照您的要求工作。您的错误是因为 s 是一个字符串并且 parse 需要一个文件或类似文件的对象。因此，“tree = ET.parse(f)”可能会更好。我建议阅读 ElementTree api，以便了解我上面使用的函数在实践中的作用。希望对您有所帮助，如果可行，请告诉我。

【解决方案3】：

XML 是结构化数据。与使用字符串操作从中获取数据相比，您可以做得很多。标准库中有sax、dom 和elementree 模块以及高质量的lxml 库，可以更可靠地为您完成工作。

【讨论】：

实际上我在 PyS60 模块中编程时受限于模块利用率
sax、dom 和 elementree 是标准发行版的一部分。在任何情况下，基于字符串的 XML 解析都会中断，您的代码将无法在野外生存。

【解决方案4】：

好吧，这里是一个针对您的特定案例的非完整解析器解决方案：

import urllib

def getWeather(city):
    ''' given city name or postal code,
        return dictionary with current weather conditions
    '''
    url = 'http://www.google.com/ig/api?weather='
    try:
        f = urllib.urlopen(url + urllib.quote(city))
    except:
        return "Error opening url"
    s = f.read().replace('\r','').replace('\n','')
    if '<problem' in s:
        return "Problem retreaving weather (invalid city?)"

    weather = s.split('</current_conditions>')[0]  \
               .split('<current_conditions>')[-1]  \
               .strip('</>')                       
    wdict = dict(i.split(' data="') for i in weather.split('"/><'))
    return wdict

及使用示例：

>>> weather = getWeather('94043')
>>> weather
{'temp_f': '67', 'temp_c': '19', 'humidity': 'Humidity: 61%', 'wind_condition': 'Wind: N at 21 mph', 'condition': 'Sunny', 'icon': '/ig/images/weather/sunny.gif'}
>>> weather['humidity']
'Humidity: 61%'
>>> print '%(condition)s\nTemperature %(temp_c)s C (%(temp_f)s F)\n%(humidity)s\n%(wind_condition)s' % weather
Sunny
Temperature 19 C (67 F)
Humidity: 61%
Wind: N at 21 mph

PS。请注意，谷歌输出格式的一个相当微不足道的改变会破坏这一点——比如他们是否要在标签或属性之间添加额外的空格或制表符。他们避免减少http响应的大小。但如果他们这样做了，我们就必须熟悉正则表达式和 re.split()

PPS。文档中解释了str.split(sep) 的工作原理，以下是摘录：返回字符串中的单词列表，使用 sep 作为分隔符字符串。 ... sep 参数可能包含多个字符（例如，'123'.split('') 返回 ['1', '2', '3']）。所以'text1<tag>text2</tag>text3'.split('</tag>') 给了我们['text1<tag>text2', 'text3']，然后[0] 拾取第一个元素'text1<tag>text2'，然后我们拆分并拾取包含我们感兴趣的数据的'text2'。真的很老套。

【讨论】：

你能解释一下这个 .split("..........")[0]\... 我的意思是这背后的逻辑会有帮助吗..谢谢