【问题标题】:Scrapy Json return the same contentsScrapy Json 返回相同的内容
【发布时间】:2018-10-08 01:06:37
【问题描述】:

我开发了这个scrapy爬虫,它有一个循环从一个站点抓取10个页面 循环运行良好,日志显示正确的 url 列表

2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=8&ajax=true>
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=9&ajax=true>

但是结果总是一样的,并且返回page1的内容 我在 shell 中进行了测试,它也可以从浏览器中正常工作。只有使用scrapy crawler才会出现问题 我试过用start_urls,url方法,总是同样的问题

有什么想法吗?

import scrapy
import json
import urllib
import time
import datetime
import re
from re import sub
from decimal import Decimal
#from prod.items import ProdItem
from staging.items import StagingItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d')

class QuotesSpider(scrapy.Spider):
    name = "lazada2"
    def start_requests(self):
        for i in range(1, 10):
            urls = 'https://www.lazada.vn/trang-diem/?page=%s&ajax=true' % i
            yield scrapy.Request(url=urls, callback=self.parse)

    def parse(self,response):
        data = json.loads(response.body)
        next_page = data['mainInfo']['page']
        for product in data['mods']['listItems']:
            item = StagingItem()
            item['collector_sku'] = product['name']
            if 'originalPrice' in product:
                item['collector_price_promo'] = product['originalPrice'],
            else:
                item['collector_price_promo'] = '',
            item['collector_retailer'] = 'Lazada'
            item['collector_url'] = product['productUrl'],
            item['collector_photo_url'] = product['image']
            item['collector_brand'] = product['brandName']
            item['collector_quantity'] = 'NA'
            item['collector_category'] = 'Makeup',
            item['collector_price'] = product['price']
            item['collector_timestamp'] = timestamp
            item['collector_local_id'] = ''
            item['collector_location_id'] = ''
            item['collector_location_name'] = ''
            item['collector_vendor_id'] = ''
            item['collector_vendor_name'] = ''
            yield item

【问题讨论】:

  • “结果总是一样的”是什么意思?所有生成的项目都一样吗?还是只有第一页在其他页面上被抓取?我已经测试了你的爬虫,它在我这边运行良好。您有任何有效的管道或设置吗?
  • 每个页面有40个条目,我一共抓取了9*40个条目=360个条目,但是得到了9次相同的内容,和page1的内容相同。管道是 mysql db,插入值可以正常工作
  • 好吧,这意味着第 n>1 页返回第 n=1 页结果。这意味着即使您请求第 2、3、4 页等,网站也会返回第 1 页。一个很好的猜测是禁用爬虫中的 cookie:转到 settings.py 并设置 COOKIES_ENABLED = False 并尝试再次运行它 :)
  • 我更改了设置并禁用了 cookie,结果相同
  • 你是对的。分页似乎是通过 javascript 生成的 cookie 控制的。您必须在爬虫中进行逆向工程并复制这些 cookie 标头。

标签: json scrapy


【解决方案1】:

使用 cookie 和标头

:
            headers = {
                "content-type": "application/json",
                "authority": "www.lazada.vn",
                "scheme": "https",
                "Accept-Language": "en-SG,en;q=0.9,en-US;q=0.8,zh-CN;q=0.7,zh;q=0.6,vi;q=0.5,fr;q=0.4",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
                "Accept": "*/*",
                "Path": "/trang-diem/?page=%s" % i,
                "Referer": "https://www.lazada.vn/trang-diem/?page=%s&ajax=true" % i,
                "accept-encoding": "gzip, deflate, br"
            }
            cookies = {
                "cookie": "_uab_collina=153864259681792402093714; _bl_uid=qpj7jm4CuXhcUk26er9n7hnhyRqd; t_fv=1538642596635; t_uid=mbei2vPUviVx0oPB6KjX1uVgASJvw7dA; lzd_cid=07e3d81c-bb96-4608-be5d-542d35d39dff; lzd_sid=1d8bf18519bb7fd8fb661ac558726c4d; _tb_token_=58e7f715a30eb; cna=O5A8FGGivzcCAXNPwzeoH+5y; hng=VN|vi|VND|704; userLanguageML=vi; cto_lwid=c9ad6486-acac-465f-ab05-6e0b3744d1dc; _ga=GA1.2.1435138343.1538642600; _gid=GA1.2.19901051.1538642600; cto_axid=zGni0uxNaRyv441RxQNq7EZ_LS8xiGmL; JSESSIONID=85306FF3F7612F91677FC6ED978B42E1; isg=BJ6eL8eUSXz4CZ0YqjCefDlu7zTqVCYsGgm5Z0gmm-DyaztFsOyk6OZNZi9CoFrx"
            }
            body ="?ajax=true&page=%s" % i
            urls = "https://www.lazada.vn/trang-diem/?ajax=true&page=%s" % i
            yield scrapy.Request(url=urls, body=body, cookies=cookies, headers=headers, callback=self.parse)

【讨论】:

  • 但也不行,总是返回page1的内容
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2017-12-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-10-06
  • 1970-01-01
相关资源
最近更新 更多