【问题标题】:how to scrape anonymously using Scrapy Tor Privoxy & UserAgent? (Windows 10)如何使用 Scrapy Tor Privoxy 和 UserAgent 匿名抓取? (视窗 10)
【发布时间】:2017-12-21 16:15:19
【问题描述】:

这个问题的答案很难找到,因为信息很分散,而且问题的标题有时会产生误导。下面的答案将所有需要的信息重新组合到一个地方。

【问题讨论】:

    标签: python-3.x scrapy tor privoxy


    【解决方案1】:

    你的蜘蛛应该是这样的。

    # based on https://doc.scrapy.org/en/latest/intro/tutorial.html
    
    import scrapy
    import requests
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
    
        def start_requests(self):
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
            for url in urls:
                print('\n\nurl:', url)
          ## use one of the yield below
    
                # middleware will process the request
                yield scrapy.Request(url=url, callback=self.parse) 
    
                # check if Tor has changed IP
                #yield scrapy.Request('http://icanhazip.com/', callback=self.is_tor_and_privoxy_used) 
    
    
        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = 'quotes-%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)
            print('\n\nSpider: Start')
            print('Is proxy in response.meta?: ', response.meta)
            print ("user_agent is: ",response.request.headers['User-Agent'])
            print('\n\n Spider: End')
            self.log('Saved file  ---  %s' % filename)
    
    
        def is_tor_and_privoxy_used(self, response):
            print('\n\nSpider: Start')
            print("My IP is : " + str(response.body))
            print("Is proxy in response.meta?: ", response.meta)  # not header dispo
            print('\n\nSpider: End')
            self.log('Saved file %s' % filename)
    

    你还需要在 middleware.py 和 settings.py 中添加一些东西。如果你不知道怎么做this will help you

    【讨论】:

      猜你喜欢
      • 2017-09-20
      • 2017-12-14
      • 2020-12-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-11-07
      • 2013-08-12
      • 1970-01-01
      相关资源
      最近更新 更多