【问题标题】:Using Python Scrapy to extract XPATH in a soccer live site使用 Python Scrapy 在足球直播网站中提取 XPATH
【发布时间】:2022-06-11 17:55:29
【问题描述】:

我正在尝试使用 Scrapy 在 SofaScore 中返回实时游戏的结果和统计数据。

网站:https://www.sofascore.com/

下面的代码:

import scrapy


class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['http://sofascore.com/']

    def parse(self, response):
        time1 =
response.xpath("/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").extract()
        print(time1)
        pass 

我也尝试使用response.xpath("//html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").getall(),但它什么也没返回。我使用了很多不同的 xpath,但它没有返回。我做错了什么?

比如,今天 10/06 页面上的第一场比赛是法国 vs 奥地利,xpath : /html/body/div[1]/main/div/div[2]/div/div[3]/div[ 2]/div/div/div/div/div[2]/a/div/div

【问题讨论】:

    标签: javascript python html css scrapy


    【解决方案1】:

    数据是使用 JavaScript 生成的,但您可以从 API 中获取。

    在浏览器中打开 devtools 并单击 network 选项卡。然后单击live 按钮并查看它从何处加载数据。然后查看 JSON 文件以了解其结构。

    import scrapy
    
    
    class SofascoreSpider(scrapy.Spider):
        name = 'SofaScore'
        allowed_domains = ['sofascore.com']
        start_urls = ['https://api.sofascore.com/api/v1/sport/football/events/live']
        custom_settings = {'DOWNLOAD_DELAY': 0.4}
    
        def start_requests(self):
            headers = {
                "Accept": "*/*",
                "Accept-Encoding": "gzip, deflate, br",
                "Accept-Language": "en-US,en;q=0.5",
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "DNT": "1",
                "Host": "api.sofascore.com",
                "Origin": "https://www.sofascore.com",
                "Pragma": "no-cache",
                "Referer": "https://www.sofascore.com/",
                "Sec-Fetch-Dest": "empty",
                "Sec-Fetch-Mode": "cors",
                "Sec-Fetch-Site": "same-site",
                "Sec-GPC": "1",
                "TE": "trailers",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
            }
            yield scrapy.Request(url=self.start_urls[0], headers=headers)
    
        def parse(self, response):
            events = response.json()
            events = events['events']
            # now iterate throught the list and get what you want from it
            # example:
            for event in events:
                yield {
                    'event name': event['tournament']['name'],
                    'time': event['time']
                }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-11-30
      • 2014-11-23
      相关资源
      最近更新 更多