【发布时间】:2019-11-10 18:14:02
【问题描述】:
这是我的抓取过程的起点。
https://www.storiaimoveis.com.br/alugar/brasil
这是 AJAX 调用,它以 JSON 格式为每个页面返回数据。
我的 POST 请求失败并出现错误 404。过去这些请求需要有效负载给我带来了麻烦。我总是以某种方式解决问题,但现在我试图了解我对他们做错了什么。
我的问题是;
- 随scrapy 请求一起发送的请求负载是否需要特定类型或格式?
- 我需要在发送之前致电
json.dumps(payload),还是将它们作为字典发送?。 - 是否需要在发送有效负载之前将每个键值对转换为字符串?
- 可能是我的请求失败的任何其他原因吗?
这是我的代码的相关部分。
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = [
'https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress.street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment%2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia=true&size=20&from=0&sessionId=5ff29d7e-88d0-54d5-2641-e203cafd6f4e'
]
page = 1
payload = {"locations":[{"geo":{"top_left":{"lat":5.2717863,
"lon":-73.982817},
"bottom_right":{"lat":-34.0891,
"lon":-28.650543}},
"placeId":"ChIJzyjM68dZnAARYz4p8gYVWik",
"keywords":"Brasil",
"address":{"label":"Brasil","country":"BR"}}],
"operation":["RENT"],
"bathrooms":[],
"bedrooms":[],
"garage":[],
"features":[]}
headers = {
'Accept': 'application/json',
'Content-Type': 'application/json',
'Referer': 'https://www.storiaimoveis.com.br/alugar/brasil',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
def parse(self, response):
for url in self.start_urls:
yield scrapy.Request(url=url,
method='POST',
headers=self.headers,
body=json.dumps(self.payload),
callback=self.parse_items)
def parse_items(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
print response.text
【问题讨论】:
-
尝试并解释从初始 URL 开始手动创建搜索的步骤,以及如何尝试构建 URL 以供脚本使用。
标签: python ajax web-scraping request scrapy