【问题标题】:Scrapy - Using the Content-Length header in the RequestScrapy - 在请求中使用 Content-Length 标头
【发布时间】:2019-11-21 11:25:16
【问题描述】:

This 是我要抓取的页面,this 是检索数据的 AJAX 请求。

我创建了具有相同标头和请求负载的相同 AJAX 请求。请求没有失败,但它得到一个几乎为空的 JSON,其中没有任何数据。

AJAX 请求的响应是一个 JSON 文件,其中一个键具有另一个字符串形式的 JSON。由于输出很大,我认为问题可能与Content-Length 标头有关。当我使用Content-Length 标头时,请求会以400 Bad Request 失败,而当我不使用它时,请求不会得到任何数据。

我应该如何从这个 url 获得一个有效的请求?

class MySpider(CrawlSpider):

    name = 'myspider'

    start_urls = [
        'https://www.propertyqueen.com.my/Search/SearchPropertyMarker'
    ]

    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate, br',
        'Host': 'www.propertyqueen.com.my',
        'Origin': 'https://www.propertyqueen.com.my',
        #'Content-Length': 689,
        'X-Requested-With': 'XMLHttpRequest',
        'Content-Type': 'application/json; charset=UTF-8',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
        'Referer': 'https://www.propertyqueen.com.my/for-sale?searchtext=',
        'Cookie': '_ga=GA1.3.513681266.1562266208; ASP.NET_SessionId=utadmp0lcxiobehzff5xpzyl; _gid=GA1.3.1978049576.1562853910; _gat=1',
    }

    payload = '{"SearchTextDisplay":"","SearchText":"","PropertyName":null,"State":"","City":"","PriceMin":50,"PriceMax":1000000,"BuildUpAreaMin":50,"BuildUpAreaMax":200000,"LandAreaMin":0,"LandAreaMax":1000000000000,"CosfMin":200,"CosfMax":1200,"PropertyFor":"ForSale","ListType":"","PropertyType":"-1","Bedroom":-1,"Bathroom":-1,"Carparking":-1,"Finishing":"-1","Furnishing":null,"Tenure":"-1","PropertyAge":"-1","FloorLebel":"-1","PageNo":1,"PageSize":10,"OpenTab":"","MinLat":0,"MaxLat":0,"MinLng":0,"MaxLng":0,"SortBy":"-1","zoom":0,"like":false,"suggestionrequired":false,"latitude":0,"longitude":0,"LandTitle":null,"CompletionYear":null,"TotalLotsUnit":null,"RentType":null,"PreferredTenant":null}'

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                method='POST',
                headers=self.headers,
                body=self.payload,
                callback=self.parse_items
            )

    def parse_items(self, response):
        print response.text.encode('utf-8')

【问题讨论】:

  • 可能是您需要将内容长度设置为有效负载的长度?请原谅我对 HTTP 的有限了解。 (样本中有效载荷的长度是 689 而不是 775)
  • @CalderWhite 我尝试使用正确的值并稍后更改为其他值。我将对其进行编辑以避免混淆。但是,Content-Length 标头不适用于任何值。

标签: python ajax http request scrapy


【解决方案1】:

稍微修改了蜘蛛,这为我生成了结果。

from scrapy.spiders import Spider
from scrapy import Request


class MySpider(Spider):

    name = 'myspider'

    start_urls = [
        'https://www.propertyqueen.com.my/Search/SearchPropertyMarker'
    ]

    headers = {
        'Origin': 'https://www.propertyqueen.com.my',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
        'Content-Type': 'application/json; charset=UTF-8',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'X-Requested-With': 'XMLHttpRequest',
        'Referer': 'https://www.propertyqueen.com.my/for-sale',
    }

    payload = '{"SearchTextDisplay":"","SearchText":"","PropertyName":null,"State":"","City":"","PriceMin":50000,"PriceMax":100000000,"BuildUpAreaMin":50,"BuildUpAreaMax":200000,"LandAreaMin":0,"LandAreaMax":1000000000000,"CosfMin":200,"CosfMax":1200,"PropertyFor":"ForSale","ListType":"","PropertyType":"-1","Bedroom":-1,"Bathroom":-1,"Carparking":-1,"Finishing":"-1","Furnishing":null,"Tenure":"-1","PropertyAge":"-1","FloorLebel":"-1","PageNo":1,"PageSize":10,"OpenTab":"","MinLat":0,"MaxLat":0,"MinLng":0,"MaxLng":0,"SortBy":"-1","zoom":0,"like":false,"suggestionrequired":false,"latitude":0,"longitude":0,"LandTitle":null,"CompletionYear":null,"TotalLotsUnit":null,"RentType":null,"PreferredTenant":null}'

    def start_requests(self):
        for url in self.start_urls:
            yield Request(
                url=url,
                method='POST',
                headers=self.headers,
                body=self.payload,
                callback=self.parse_items
            )

    def parse_items(self, response):
        print response.text.encode('utf-8')

我使用普通的 Spider 而不是 CrawlSpider,并在标题中省略了“cookie”。

【讨论】:

  • 我仍然得到一个带有这些标题的空 json。没有数据进来。
  • 现在包含了我用来测试它的完整代码,这为我提供了一个完整的 json
  • {"type":"state","zoom":0,"count":"0","result":"[]","htmlresult":"[]","agentListOnExpertise":"[]","propertySuggestion":"[]","criteriaState":"","criteriaCity":""} 这是我得到的输出。 htmlresult 键应该有数据,但它没有。
  • 你可以尝试从另一个IP执行它吗?我得到了正确的输出:{"type":"state","zoom":0,"count":"41620","result":"[{\\"State\\":\\"Kedah\\",\\"StateCount\\":252, ..., "htmlresult":"[{\\"Id\\":86755,\\"PropertyName\\":\\"Vertiq\\".
猜你喜欢
  • 2018-04-06
  • 2011-04-20
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多