【问题标题】:Scrapy and invalid cookie found in request在请求中发现 Scrapy 和无效 cookie
【发布时间】:2020-11-22 01:11:05
【问题描述】:

网页抓取需求

从 eventbrite 链接 here 的第一页抓取事件的标题。

方法

虽然页面没有太多 javascript 并且页面分页很简单,但抓取页面上每个事件的标题非常容易,并且没有问题。

但是我看到有一个 API,我想重新设计 HTTP 请求,以提高效率和更结构化的数据。

问题

我能够使用 requests python 包,使用正确的标头、cookie 和参数来模拟 HTTP 请求。不幸的是,当我在scrapy中使用相同的cookie时,它似乎在抱怨cookie字典中的三个空白键'mgrefby': '''ebEventToTrack': '''AN': ''。尽管它们在与请求包一起使用的 HTTP 请求中是空白的。

请求包代码示例

import requests

cookies = {
    'mgrefby': '',
    'G': 'v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09',
    'ebEventToTrack': '',
    'eblang': 'lo%3Den_US%26la%3Den-us',
    'AN': '',
    'AS': '50c57c08-1f5b-4e62-8626-ea32b680fe5b',
    'mgref': 'typeins',
    'client_timezone': '%22Europe/London%22',
    'csrftoken': '85d167cac78111ea983bcbb527f01d2f',
    'SERVERID': 'djc9',
    'SS': 'AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g',
    'SP': 'AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA',
}

headers = {
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'X-CSRFToken': '85d167cac78111ea983bcbb527f01d2f',
    'Content-Type': 'application/json',
    'Accept': '*/*',
    'Origin': 'https://www.eventbrite.com',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.eventbrite.com/d/ny--new-york/human-resources/?page=2',
    'Accept-Language': 'en-US,en;q=0.9',
}

data = '{"event_search":{"q":"human resources","dates":"current_future","places":\n["85977539"],"page":1,"page_size":20,"online_events_only":false,"client_timezone":"Europe/London"},"expand.destination_event":["primary_venue","image","ticket_availability","saves","my_collections","event_sales_status"]}'

response = requests.post('https://www.eventbrite.com/api/v3/destination/search/', headers=headers, cookies=cookies, data=data)

Scrapy 代码示例

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['eventbrite.com']
    start_urls = []

    cookies = {
    'mgrefby': '',
    'G': 'v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09',
    'ebEventToTrack': '',
    'eblang': 'lo%3Den_US%26la%3Den-us',
    'AN': '',
    'AS': '50c57c08-1f5b-4e62-8626-ea32b680fe5b',
    'mgref': 'typeins',
    'client_timezone': '%22Europe/London%22',
    'csrftoken': '85d167cac78111ea983bcbb527f01d2f',
    'SERVERID': 'djc9',
    'SS': 'AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g',
    'SP': 'AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA',
}

    headers = {
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
        'X-CSRFToken': '85d167cac78111ea983bcbb527f01d2f',
        'Content-Type': 'application/json',
        'Accept': '*/*',
        'Origin': 'https://www.eventbrite.com',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Dest': 'empty',
        'Referer': 'https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1',
        'Accept-Language': 'en-US,en;q=0.9',
    }

    data = '{"event_search":{"q":"human resources","dates":"current_future","places":\n["85977539"],"page":1,"page_size":20,"online_events_only":false,"client_timezone":"Europe/London"},"expand.destination_event":["primary_venue","image","ticket_availability","saves","my_collections","event_sales_status"]}'


    def start_requests(self):
        url = 'https://www.eventbrite.com/api/v3/destination/search/'
        yield scrapy.Request(url=url, method='POST',headers=self.headers,cookies=self.cookies,callback=self.parse)
    def parse(self,response):
        print('request')

输出

2020-08-01 11:55:33 [scrapy.core.engine] INFO: Spider opened
2020-08-01 11:55:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-01 11:55:33 [test] INFO: Spider opened: test
2020-08-01 11:55:33 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\Aaron\projects\scrapy\eventbrite\.scrapy\httpcache
2020-08-01 11:55:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-01 11:55:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.eventbrite.com/robots.txt> (referer: None) ['cached']
2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'mgrefby', 'value': ''} ('value' is missing)
2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'ebEventToTrack', 'value': ''} ('value' is missing)
2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'AN', 'value': ''} ('value' is missing)   
2020-08-01 11:55:33 [scrapy.core.engine] DEBUG: Crawled (401) <POST https://www.eventbrite.com/api/v3/destination/search/> (referer: https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1) ['cached']   
2020-08-01 11:55:33 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.eventbrite.com/api/v3/destination/search/>: HTTP status code is not handled or not allowed
2020-08-01 11:55:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-01 11:55:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1540,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 32163,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/401': 1,
 'elapsed_time_seconds': 0.187986,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 1, 10, 55, 33, 202931),
 'httpcache/hit': 2,
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/401': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 12,
 'log_count/WARNING': 3,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 8, 1, 10, 55, 33, 14945)}
2020-08-01 11:55:33 [scrapy.core.engine] INFO: Spider closed (finished)

尝试解决问题

401 状态似乎是指授权,对此我只能假设它不喜欢我发送的 cookie。

  1. 我已将COOKIES_ENABLED = True 设置为与以前相同的输出
  2. 我已设置 COOKIES_DEBUG = True 并查看下面的输出

cookies_debug=True 的输出

2020-08-01 12:05:15 [scrapy.core.engine] INFO: Spider opened
2020-08-01 12:05:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-01 12:05:15 [test] INFO: Spider opened: test
2020-08-01 12:05:15 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\Aaron\projects\scrapy\eventbrite\.scrapy\httpcache
2020-08-01 12:05:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-01 12:05:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.eventbrite.com/robots.txt> (referer: None) ['cached']
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'mgrefby', 'value': ''} ('value' is missing)
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'ebEventToTrack', 'value': ''} ('value' is missing)
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'AN', 'value': ''} ('value' is missing)   
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST https://www.eventbrite.com/api/v3/destination/search/>
Cookie: G=v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09; eblang=lo%3Den_US%26la%3Den-us; AS=50c57c08-1f5b-4e62-8626-ea32b680fe5b; mgref=typeins; client_timezone=%22Europe/London%22; csrftoken=85d167cac78111ea983bcbb527f01d2f; SERVERID=djc9; SS=AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g; SP=AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA

2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <401 https://www.eventbrite.com/api/v3/destination/search/>
Set-Cookie: SP=AGQgbbno_KHLNiLzDpLHcdI4kotUbRiTxMMY5N0t7VudPU_QGCm2Q0nH7-J99aoRZvGLxXfREH5YfPAtK52iiiLcEpnjh1G43ZBxKuo9qvJHykLV23ZIjaFK0iIr6ptOaczMoQhkaqE-7nJ8t2Ykt18CN196pKZ5QhFuXy6SnspZ0toEGChZcQgmrAAAVPfuoiiUmbTG_wJC8_KikL2sYl2s6-KWUOOpjRFJCko5RGgiyC2Osu9vxZ8; Domain=.eventbrite.com; httponly; Path=/; secure

Set-Cookie: G=v%3D2%26i%3D5cebebd2-2a7f-4638-9912-0abf19111a0c%26a%3Dd33%26s%3Df967e32d15dda2f06b392f22451af935d93f88d1; Domain=.eventbrite.com; expires=Sat, 31-Jul-2021 22:46:28 GMT; httponly; Path=/; secure     

Set-Cookie: ebEventToTrack=; Domain=.eventbrite.com; expires=Sun, 30-Aug-2020 22:46:28 GMT; httponly; Path=/; secure

Set-Cookie: SS=AE3DLHRgTIL46n9XiOZiJRSkccGnNXSMkA; Domain=.eventbrite.com; httponly; Path=/; secure

Set-Cookie: eblang=lo%3Den_US%26la%3Den-us; Domain=.eventbrite.com; expires=Sat, 31-Jul-2021 22:46:28 GMT; httponly; Path=/; secure

Set-Cookie: AN=; Domain=.eventbrite.com; expires=Sun, 30-Aug-2020 22:46:28 GMT; httponly; Path=/; secure

Set-Cookie: AS=350def0c-ed27-45ab-b12c-02e9fb68a8ae; Domain=.eventbrite.com; httponly; Path=/; secure

Set-Cookie: SERVERID=djc44; path=/; HttpOnly; Secure

2020-08-01 12:05:15 [scrapy.core.engine] DEBUG: Crawled (401) <POST https://www.eventbrite.com/api/v3/destination/search/> (referer: https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1) ['cached']
2020-08-01 12:05:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.eventbrite.com/api/v3/destination/search/>: HTTP status code is not handled or not allowed
2020-08-01 12:05:15 [scrapy.core.engine] INFO: Closing spider (finished)
  1. 我尝试了一个用于 cookie 持久性的 scrapy 自定义 cookie 下载器中间件,但又出现了与以前相同的错误
  2. 我曾考虑使用浏览器自动化来获取 cookie,再次考虑这一点,因为我不想继续获取 cookie。

我不明白的是,在请求 python 包中使用相同的 cookie、标头和参数,JSON 对象响应就在那里。使用scrapy,它抱怨空白字典值。

如果我犯了一个明显的错误,或者看看为什么 API 端点通过请求接受的 cookie 似乎在 Scrapy 中不起作用,我将不胜感激。

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    看起来他们使用的是not value,而不是更准确的value is not NoneOpening an issue 是您唯一的长期解决方案,但子类化 cookie 中间件是短期的、非 hacky 的解决方法。

    一个hacky修复是利用他们在执行'; '.join()时没有正确转义cookie值的事实,因此您可以将cookie的值设置为合法的cookie指令(我选择HttpOnly,因为你'不关心 JS),cookiejar 似乎丢弃它,产生你关心的实际价值

    >>> from scrapy.downloadermiddlewares.cookies import CookiesMiddleware
    >>> from scrapy.http import Request
    >>> cm = CookiesMiddleware(debug=True)
    >>> req = Request(url='https://www.example.com', cookies={'AN': '; HttpOnly', 'alpha': 'beta'})
    >>> cm.process_request(req, spider=None)
    2020-08-01 15:08:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.example.com>
    Cookie: AN=; alpha=beta
    >>> req.headers
    {b'Cookie': [b'AN=; alpha=beta']}
    

    【讨论】:

    • 感谢您深思熟虑的回复!从 if not 和 if is not None 评估空字符串之间是否有区别?对不起,我缺乏python知识!我知道空字符串是虚假的。我应该查看日志记录以及它在代码库中的来源,这会让我更接近!我将看看 cookie 中间件的子类化。我想知道为什么 cookiejar 似乎丢弃了 httponly ?
    • 在评估空字符串时是否存在 if not 和 if is not None 之间的区别? 在您输入本可以尝试的评论时你的python解释器,但是是的,当然,否则就不需要表达该检查的语法; if not foo 在 python 中被称为“真实性”,包括 None、空字符串、空列表以及 - 令人困惑的是 - 任何实现 def __len__ 或一堆其他魔法方法的对象
    • 我想知道为什么 cookiejar 似乎丢弃了 httponly? 因为它是浏览器的指令,而 cookiejar 不是其中之一,所以保留该信息没有任何价值
    【解决方案2】:

    添加到 Mdaniel 的回复中,我已经打开了一个问题,因为我们遇到了同样的问题,并且我参考了您的 stackoverflow 线程。

    我们当前的解决方案是使用旧版本的 scrapy(2.2.0 或更低版本),因为最新的 2.3.0 是添加此 cookie 检查的地方。 https://github.com/scrapy/scrapy/commit/f6ed5edc31e7cc66225c0860e1534a6230511954 scrapy/downloadermiddlewares/cookies.py 第 78 行

    如果您想添加我遗漏的任何内容,这就是问题所在。 https://github.com/scrapy/scrapy/issues/4766

    【讨论】:

    • 啊,这是我要做的事情!
    猜你喜欢
    • 2018-06-18
    • 1970-01-01
    • 2013-12-23
    • 2018-03-24
    • 1970-01-01
    • 2018-06-22
    • 2015-08-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多