即使使用正确的 xpath，Scraper 也会抛出错误答案

【问题标题】：Scraper throws an error even if right xpath is used即使使用正确的 xpath，Scraper 也会抛出错误
【发布时间】：2018-06-01 18:27:57
【问题描述】：

我已经结合 lxml 库在 python 中编写了一个脚本，以从一大块 html elements 中解析出一些 price（在本例中为 80 和 100）。我使用xpaths 来完成这项工作。当我使用.fromstring() 时，我在下面的刮板中使用的xpaths 都可以流畅地工作。但是，当我选择使用从lxml.etree 导入的HTML 时，包含contains() 的xpath 表达式将失败。事实证明，当我在刮板中使用多个 class 名称时，它可以工作，但是当从 compound class names 中选择一个 single class name 时，它会引发错误。

如何在不使用compound class names 的情况下处理这种情况；而是使用single class name 跟随.contains() 模式或其他东西？

这是我的尝试：

from lxml.etree import HTML

elements =\
"""
    <li class="ProductPrice">
      <span class="Regular Price">80.00</span>
    </li>
    <li class="ProductPrice">
      <span class="Regular Price">100.00</span>
    </li>
"""
root = HTML(elements)
for item in root.findall(".//*[@class='ProductPrice']"):
    # regular = item.find('.//span[@class="Regular Price"]').text
    regular = item.find('.//span[contains(@class,"Regular")]').text
    print(regular)

顺便说一句，上面脚本中使用的注释掉的xpath 工作正常。但是不能去 fo .contains() 表达式，它会抛出以下错误：

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 15, in <module>
    regular = item.find('.//span[contains(@class,"Regular")]').text
  File "src\lxml\etree.pyx", line 1526, in lxml.etree._Element.find
  File "src\lxml\_elementpath.py", line 311, in lxml._elementpath.find
  File "src\lxml\_elementpath.py", line 300, in lxml._elementpath.iterfind
  File "src\lxml\_elementpath.py", line 283, in lxml._elementpath._build_path_iterator
  File "src\lxml\_elementpath.py", line 229, in lxml._elementpath.prepare_predicate
SyntaxError: invalid predicate

最后一件事：我不想使用compound class names，因为很少有网站动态生成它们。谢谢。

【问题讨论】：

.find() 只支持基本的 xpath。请改用.xpath()。喜欢regular = item.xpath('.//span[contains(@class,"Regular")]')[0].text（未经测试）。 lxml.de/xpathxslt.html
感谢@Daniel Haley 的快速回复。在.fromstring() 和.HTML() 中使用的.xpath() 和.cssselect() 似乎工作相同。您应该将其作为答案，以便我接受。

标签： python python-3.x xpath web-scraping lxml

【解决方案1】：

.find() 只支持基本的 xpath。

改用.xpath()。

示例（未经测试）...

regular = item.xpath('.//span[contains(@class,"Regular")]')[0].text

更多详情请见http://lxml.de/xpathxslt.html。

【讨论】：