【问题标题】:Setting up proxy according to url in Scrapy在 Scrapy 中根据 url 设置代理
【发布时间】:2017-05-08 12:42:46
【问题描述】:

我有一个 URL 列表,其中一些包含 .onion 站点和其他明确的网站对于普通的 .com 和 .net 站点,或者它对 .onion 站点使用 Socks5 代理

def random_dedicate_proxy():
    dedicated_ips = [  proxy1, proxy2, proxy3
                ]
    dedicated_proxies = [{'http':'http://' + ip, 'https':'https://' + ip} for ip in dedicated_ips]
    return choice(dedicated_proxies)

def proxy_selector(url):
    TOR_CLIENT = 'socks5h://127.0.0.1:9050'
    if '.onion' in url:
        proxy  = {'http': TOR_CLIENT, 'https': TOR_CLIENT}
    else:
        proxy = random_dedicate_proxy()
    return proxy

def get_urls_from_spreadsheet():
    fname = 'list_of_stuff.csv'
    url_df = pd.read_csv(fname,usecols=['URL'],keep_default_na=False)
    URL = url_df.URL.dropna()
    urls = [clean_url(url) for url in URL if url != '']
    return urls

class BasicSpider(scrapy.Spider):

    name = "basic"
    rotate_user_agent = True
    start_urls = get_urls_from_spreadsheet()


    def parse(self, response):
        item = StatusCehckerItem()
        item['url'] = response.url
        item['status_code'] = response.status
        item['time'] = time.time()
        response.meta['proxy'] = proxy_selector(response.url)
        return item

使用此代码时,我得到一个DNSLookupError: DNS lookup failed: no results for hostname lookup: mqqrfjmfu2i73bjq.onion/.

【问题讨论】:

  • 你在{'proxy': proxy}这里输入了什么proxy

标签: python proxy scrapy tor socks


【解决方案1】:

确保在蜘蛛设置中将HTTPPROXY_ENABLED 设置为True。然后在您的 start_requests 方法中选择代理 URL 的方法。

class BasicSpider(scrapy.Spider):

    custom_settings = {
        'HTTPPROXY_ENABLED': True # can also set this in the settings.py file
    }
    name = "basic"
    rotate_user_agent = True

    def start_requests(self):
        urls = get_urls_from_spreadsheet()
        for url in urls:
            proxy = proxy_selector(url)
            yield scrapy.Request(url=url, callback=self.parse, meta={'proxy': proxy})

    def parse(self, response):
        item = StatusCehckerItem()
        item['url'] = response.url
        item['status_code'] = response.status
        item['time'] = time.time()
        return item

【讨论】:

  • 感谢您的回答,当我运行您的建议时,我收到错误 url = url.strip() AttributeError: 'dict' object has no attribute 'strip 我认为那是因为我的代理是 http 和 https 的字典,有没有办法让我改变它,所以 scrapy 可以接受他们?我指定meta={'http_proxy':proxy, https_proxy:proxy}
  • AFAIK 你的 meta 需要有一个带有键 proxy 和 1 个 URL 的字典。所以它应该看起来像meta={'proxy': proxy}
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2013-09-26
  • 1970-01-01
  • 1970-01-01
  • 2012-10-15
  • 1970-01-01
  • 2014-04-19
  • 2018-10-09
相关资源
最近更新 更多