Scrapy 从 XPath 返回空数组答案

【问题标题】：Scrapy returning empty array from XPathScrapy 从 XPath 返回空数组
【发布时间】：2018-06-26 19:36:52
【问题描述】：

我正在尝试从以下网页收集有关运动员的数据：https://www.athletic.net/TrackAndField/Athlete.aspx?AID=7844096#!/L4。我已经能够收集到运动员的姓名，但是使用相同的方法收集他们的学校名称时遇到了困难。我知道学校名称作为文本包含在块内的链接中，但它只返回一个空数组。

这是我的代码：

import scrapy

class AthletesSpider(scrapy.Spider):
    name = 'athletes'
    allowed_domains = ['athletic.net']
    start_urls = ['https://www.athletic.net/TrackAndField/Athlete.aspx?AID=7844096#!/L0']

    def parse(self, response):
        yield {
            'athlete_name' : response.xpath("//h2/text()").extract_first(),
            'school_name' : response.xpath("//h1/a/text()").extract_first()
        }

我错过了什么吗？

【问题讨论】：

标签： python python-3.x xpath scrapy web-crawler

【解决方案1】：

在字典中添加逗号

import scrapy

class AthletesSpider(scrapy.Spider):
    name = 'athletes'
    allowed_domains = ['athletic.net']
    start_urls = ['https://www.athletic.net/TrackAndField/Athlete.aspx?AID=7844096#!/L0']

    def parse(self, response):
        yield {
            'athlete_name' : response.xpath("//h2/text()").extract_first(), <--here
            'school_name' : response.xpath("//h1/a/text()").extract_first()
        }

【讨论】：

天哪，谢谢你这太愚蠢了。但是第二行仍然返回一个空数组而不是学校名称 - 是否还缺少其他内容？
您可以尝试的一件事是（如果您有 chrome）检查页面，找到元素，右键单击，然后单击复制 xpath。这通常是我用来轻松识别元素的方法。
我得到了：//*[@id="anetMain"]/div[3]/team-nav/div/div/team-nav-logo/div/div/h1/a 用于学校元素。
哦，这是一个有用的提示！但是当我尝试运行 'school_name' : response.xpath("//*[@id="anetMain"]/div[3]/team-nav/div/div/team-nav-logo/div/div/h1/a").extract_first() 时，又遇到了另一个“无效语法”错误
那是因为你需要把它放在单引号中，而不是双引号:)