【发布时间】:2015-09-10 14:04:25
【问题描述】:
在尝试抓取网站的某些元素时,我无法理解要选择 Xpath 的哪个部分。在这种情况下,我正在尝试抓取本文中链接的所有网站(例如,xpath 的这一部分:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
我的蜘蛛可以工作,但它不会刮任何东西!
我的代码如下:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[@id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/@href').extract()
yield item
【问题讨论】:
标签: python xpath web-scraping scrapy