【发布时间】:2021-06-26 20:55:30
【问题描述】:
通过 Scrapy FormRequest 发送 Post 请求会导致 400 错误,而通过 Python Requests 发出的相同请求成功。
请求headers 和params 不会是问题,因为它们可以处理请求。 Scrapy 中的什么可能会破坏这一点?
以下代码在scrapy shell中运行:
url = 'https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html'
headers = {
'authority': 'www.tripadvisor.co.uk',
'method': 'POST',
'scheme': 'https',
'accept': 'text/html, */*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'content-length': '102',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'dnt': '1',
'origin': 'https://www.tripadvisor.co.uk',
'pragma': 'no-cache',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
params = {
'returnTo': '#REVIEWS',
'filterLang': 'ALL',
'changeSet': 'REVIEW_LIST'
}
Scrapy FormRequst 返回 400 错误。
In [10]: req = scrapy.http.FormRequest(
...: url,
...: method='POST',
...: formdata=params,
...: headers=headers)
In [11]: fetch(req)
2021-06-26 21:28:18 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html> (referer: None)
Python 请求返回 200,我可以访问内容。
In [17]: r = requests.post(url=url, headers=headers, json=params)
2021-06-26 21:30:02 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.tripadvisor.co.uk:443
2021-06-26 21:30:04 [urllib3.connectionpool] DEBUG: https://www.tripadvisor.co.uk:443 "POST /ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html HTTP/1.1" 200 16360
In [18]: r.status_code
Out[18]: 200
【问题讨论】:
标签: python web-scraping python-requests scrapy