【问题标题】:Scrapy returns 400-error when trying to scrape Ajax call pageScrapy 在尝试抓取 Ajax 调用页面时返回 400 错误
【发布时间】:2020-08-29 10:34:07
【问题描述】:

我正在尝试抓取使用 Ajax 分页的https://wegotthiscovered.com/reviews/。我尝试了所有方法,但它没有返回或返回 http-status 代码 400。任何人都可以帮助解决这个问题吗?

import json
import scrapy
from..items import xyzItem

class MySpider(scrapy.Spider):
    name = 'abc'
    data = {"id":"infinite_scroll_1","order":"","orderby":"","catnames":"reviews","postnotin":"900303,899404,898188,897386,896672,893944,895290,895136,892571,892412,891795,887847","timestampbefore":'1589354802'}
    headers = {"content-type": "application/json"}
    url = 'https://wegotthiscovered.com/wp-admin/admin-ajax.php'

    def start_requests(self):
        yield scrapy.Request(
            url=self.url,
            method='POST',
            body=json.dumps(self.data),
            headers=self.headers,
            meta={'index': 0}
        )

    def parse(self, response):
        items = xyzItem()
        i = 1
        movie_title = response.css('h4').css('::text').getall() 
        # movie_text = response.css('.summary').xpath('text()').getall() 
        movie_id = response.css('h4').css('::attr(href)').getall()   


        li = items['movie_title']
        for i in range(len(li)):
            li_split =  li[i].split(" ")
            #print(movietitle)
            #if 'Review:' in li_split or 'review:' in li_split or 'Review' in li_split or 'review' in li_split:
            outputs = DeccanchronicleItem()
            outputs['page_title'] = li[i]
            # outputs['review_content'] = items['movie_text'][i]
            outputs['review_link'] = items['movie_id'][i]
            yield outputs

        page = response.meta['index'] + 1
        self.data['index'] = page
        yield scrapy.Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'index': page})

【问题讨论】:

    标签: python python-3.x python-2.7 web-scraping scrapy


    【解决方案1】:

    您的代码的主要问题是您没有使用正确的request

    class MySpider(scrapy.Spider):
        name = 'wegotthiscovered'
        data = {
            "id":"infinite_scroll_1",
            "order":"",
            "orderby":"",
            "catnames":"reviews",
            "postnotin":"900303,899404,898188,897386,896672,893944,895290,895136,892571,892412,891795,887847",
            "timestampbefore":'1589363845'
        }
        headers = {
            "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
            "x-requested-with": "XMLHttpRequest",
            'referer': "https://wegotthiscovered.com/reviews/",
        }
        url = 'https://wegotthiscovered.com/wp-admin/admin-ajax.php'
        start_urls = ['https://wegotthiscovered.com/reviews/'] # I used this to get cookies BEFORE POST request
    
        def parse(self, response):
            yield scrapy.FormRequest(
                url=self.url,
                method='POST',
                callback=self.parse_search,
                formdata={
                    'page': '2',
                    'action': 'face3_infinite_scroll',
                    'attrs': json.dumps(self.data),
                }
                ,
                headers=self.headers,
                meta={'index': 0}
            )
    
        def parse_search(self, response):
            items = xyzItem()
            i = 1
            movie_title = response.css('h4').css('::text').getall() 
            # movie_text = response.css('.summary').xpath('text()').getall() 
            movie_id = response.css('h4').css('::attr(href)').getall()   
    
    
            li = items['movie_title']
            for i in range(len(li)):
                li_split =  li[i].split(" ")
                #print(movietitle)
                #if 'Review:' in li_split or 'review:' in li_split or 'Review' in li_split or 'review' in li_split:
                outputs = DeccanchronicleItem()
                outputs['page_title'] = li[i]
                # outputs['review_content'] = items['movie_text'][i]
                outputs['review_link'] = items['movie_id'][i]
                yield outputs
    
            page = response.meta['index'] + 1
            self.data['index'] = page
            yield scrapy.Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'index': page})
    

    顺便说一句,您的解析部分将不起作用,因为您需要处理 JSON 响应(从中解析“html”部分)。

    更新一切正常(HTML 包含电影列表):

    2020-05-16 00:20:23 [scrapy.core.engine] INFO: Spider opened
    2020-05-16 00:20:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2020-05-16 00:20:23 [wegotthiscovered] INFO: Spider opened: wegotthiscovered
    2020-05-16 00:20:23 [wegotthiscovered] INFO: Spider opened: wegotthiscovered
    2020-05-16 00:20:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2020-05-16 00:20:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wegotthiscovered.com/reviews/> (referer: None)
    2020-05-16 00:20:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://wegotthiscovered.com/wp-admin/admin-ajax.php> (referer: https://wegotthiscovered.com/reviews/)
    

    要么你的 IP 被禁止,要么你不运行我的代码。

    【讨论】:

    • 在这之后同样的错误继续抛出。你能帮我吗。 2020-05-15 21:41:45 [scrapy.core.engine] 调试:已爬网 (400) wegotthiscovered.com/wp-admin/admin-ajax.php> (referer: wegotthiscovered.com/reviews) 2020-05-15 21:41:45 [scrapy.spidermiddlewares .httperror] 信息:忽略响应 wegotthiscovered.com/wp-admin/admin-ajax.php>:HTTP 状态代码未处理或不允许
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-04-27
    • 2019-04-15
    相关资源
    最近更新 更多