使用 Python Scrapy 在足球直播网站中提取 XPATH答案

【问题标题】：Using Python Scrapy to extract XPATH in a soccer live site使用 Python Scrapy 在足球直播网站中提取 XPATH
【发布时间】：2022-06-11 17:55:29
【问题描述】：

我正在尝试使用 Scrapy 在 SofaScore 中返回实时游戏的结果和统计数据。

网站：https://www.sofascore.com/

下面的代码：

import scrapy


class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['http://sofascore.com/']

    def parse(self, response):
        time1 =
response.xpath("/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").extract()
        print(time1)
        pass

我也尝试使用response.xpath("//html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").getall()，但它什么也没返回。我使用了很多不同的 xpath，但它没有返回。我做错了什么？

比如，今天 10/06 页面上的第一场比赛是法国 vs 奥地利，xpath : /html/body/div[1]/main/div/div[2]/div/div[3]/div[ 2]/div/div/div/div/div[2]/a/div/div

【问题讨论】：

标签： javascript python html css scrapy

【解决方案1】：

数据是使用 JavaScript 生成的，但您可以从 API 中获取。

在浏览器中打开 devtools 并单击 network 选项卡。然后单击live 按钮并查看它从何处加载数据。然后查看 JSON 文件以了解其结构。

import scrapy


class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['https://api.sofascore.com/api/v1/sport/football/events/live']
    custom_settings = {'DOWNLOAD_DELAY': 0.4}

    def start_requests(self):
        headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.5",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "DNT": "1",
            "Host": "api.sofascore.com",
            "Origin": "https://www.sofascore.com",
            "Pragma": "no-cache",
            "Referer": "https://www.sofascore.com/",
            "Sec-Fetch-Dest": "empty",
            "Sec-Fetch-Mode": "cors",
            "Sec-Fetch-Site": "same-site",
            "Sec-GPC": "1",
            "TE": "trailers",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        yield scrapy.Request(url=self.start_urls[0], headers=headers)

    def parse(self, response):
        events = response.json()
        events = events['events']
        # now iterate throught the list and get what you want from it
        # example:
        for event in events:
            yield {
                'event name': event['tournament']['name'],
                'time': event['time']
            }

【讨论】：