从css节点scrapy中提取文本答案

【问题标题】：extracting text from css node scrapy从css节点scrapy中提取文本
【发布时间】：2018-08-12 04:20:51
【问题描述】：

我正在尝试从该页面中获取目录 ID 号：

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='

response = HtmlResponse(url=url)

使用 css 选择器（在 R 中与 rvest::html_nodes 一起使用）

".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"

我想检索目录 ID，在这种情况下应该是：

如果使用 xpath 更容易完成，我没问题

【问题讨论】：

您能否发布您正在使用的完整代码。也许我可以帮忙。

标签： python css scrapy

【解决方案1】：

我这里没有scrapy，但是测试了这个xpath，它会给你href：

//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href

如果您在使用 scrapy 和 css 选择器语法时遇到太多麻烦，我还建议您尝试 BeautifulSoup python 包。使用 BeautifulSoup，您可以执行以下操作：

link.get('href')

【讨论】：

【解决方案2】：

如果需要从href解析id：

catalog_id = response.xpath("//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href").re_first( r'(\d+)$' )

【讨论】：

【解决方案3】：

h5 元素中似乎只有一个链接。简而言之：

response.css('h5 > a::attr(href)').re('(\d+)$')

【讨论】：