【问题标题】:Fetching data using Python & lxml使用 Python 和 lxml 获取数据
【发布时间】:2012-03-28 13:26:38
【问题描述】:

我有一个如下所示的 HTML。我想获取<span class="zzAggregateRatingStat"> 中的文本。根据下面给出的例子,我会得到 3 和 5。

对于这项工作,我使用的是 Python2.7 和 lxml

<div class="pp-meta-review">
<span class="zrvwidget" style="">
    <span g:inline="true" g:type="NumUsersFoundThisHelpful" g:hideonnoratings="true" g:entity.annotation.groups="maps"    g:entity.annotation.id="http://maps.google.com/?q=Central+Kia+of+Irving++(972)+659-2204+loc:+1600+East+Airport+Freeway,+Irving,+TX+75062&gl=US&sll=32.83624,-96.92526" g:entity.annotation.author="AIe9_BH8MR-1JD_4BhwsKrGCazUyU5siqCtjchckDcg5BAl5rOLd9nvhJJDTrtjL-xFI8D42bD_7">
        <span class="zzNumUsersFoundThisHelpfulActive" zzlabel="helpful">
            <span>
                <span class="zzAggregateRatingStat">3</span>
            </span>
            <span>
                <span>&nbsp;</span>
                      out of
                <span>&nbsp;</span>
            </span>
            <span>
                <span class="zzAggregateRatingStat">5</span>
            </span>
            <span>
                <span>&nbsp;</span>
                    people found this review helpful.
            </span>
       </span>
   </span>
</span>
</div>

【问题讨论】:

  • 获取.
  • ...并通过展示您尝试过的内容来完成问题。
  • 我真的很抱歉错字。 Stackoverflow 将其作为 HTML 标记

标签: python web-scraping lxml python-2.7


【解决方案1】:

以下代码适用于您的输入:

import lxml.html
root = lxml.html.parse('text.html').getroot()
for span in root.xpath('//span[@class="zzAggregateRatingStat"]'):
    print span.text

打印出来:

3
5

我更喜欢使用 lxmlxpath 而不是 CSSSelectors,尽管它们都可以完成这项工作。

ChrisP 的示例打印 3 但如果您在实际输入上运行它,我们会收到错误:

$ python chrisp.py
Traceback (most recent call last):
  File "chrisp.py", line 6, in <module>
    doc = fromstring(text)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 3, column 210

ChrisP 的代码可以更改为使用 lxml.html.fromstring - 这是一个更宽松的解析器 - 而不是 lxml.etree.fromstring

如果进行此更改,则会打印3

【讨论】:

  • 您好,感谢您的回复,我无法完全记下您的网站代码maps.google.com/maps/…。它不断给出不同的错误
  • 将 lxml.etree.fromstring 更改为 lxml.html.fromstring 有效!坦克!唯一的问题是您在 lxml.html 中没有 pretty_print 选项 :(
【解决方案2】:

这是clearly documented at the lxml website

from lxml.etree import fromstring
from lxml.cssselect import CSSSelector

sel = CSSSelector('.zzAggregateRatingStat')
text = '<span><span class="zzAggregateRatingStat">3</span></span>'
doc = fromstring(text)
el = sel(doc)[0]
print el.text

【讨论】:

  • 感谢您的回答,我一直在网站maps.google.com/maps/… 上尝试此代码,但都是徒劳的,请您查看一下
  • @Zulaikha,如果您想获得企业评级,您可能需要查看 Google 和 Yelp 提供的 API,而不是抓取页面。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-09-02
  • 2017-03-22
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多