python - 如何在python scrapy中使用xpath获取项目数组？

【问题标题】：How to get array of items using xpath in python scrapy?python - 如何在python scrapy中使用xpath获取项目数组？
【发布时间】：2014-10-04 17:04:44
【问题描述】：

我需要从 html 页面获取并解析 div 数组。我是这样写的：

def parse_public(self, response):
    hxs = Selector(response)
    posts = hxs.xpath("//*div[matches(@id, 'wall-28701979_\d{5}')")
    # or something like this
    # posts = hxs.findall("//div[starts-with(@id,'wall-28701979_')")
    print posts

完整的 xpath 是：//*[@id="wall-28701979_XXXXX"]/div[2]/div[1]/text() 其中 XXXXX - 随机 5 位数字。所以我需要从页面中获取所有这样的元素。但我得到了一个 exceptions.ValueError: Invalid XPath: 。我该如何解决？谢谢

【问题讨论】：

您可能想要提供您尝试解析的标记的示例

标签： python regex xpath web-scraping scrapy

【解决方案1】：

matches() 仅在 xpath 2.0 中可用。 Scrapy（好吧，lxml）仅支持xpath 1.0。

你也错过了结束]，但这不是很重要吗。

相反，您可以使用starts-with()：

hxs.xpath("//div[starts-with(@id, 'wall-28701979_')]")

或者，您也可以使用re:test。来自scrapy shell的演示：

$ cat index.html
<div>
    <div id="wall-28701979_12345">test1</div>
    <div id="wall-28701979_21231">test2</div>
    <div id="wall-28701979_31233">test3</div>
</div>
$ scrapy shell index.html
>>> response.xpath('//div[re:test(@id, "wall-28701979_\d{5}")]/text()').extract()
[u'test1', u'test2', u'test3']

【讨论】：