【问题标题】:Scrapy, privoxy and Tor: SocketError: [Errno 61] Connection refusedScrapy、privoxy 和 Tor:SocketError:[Errno 61] 连接被拒绝
【发布时间】:2017-12-15 15:35:49
【问题描述】:

我正在使用带有 Privoxy 和 Tor 的 Scrapy。这是我之前的问题Scrapy with Privoxy and Tor: how to renew IP,这里是蜘蛛:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "****"
    start_urls = [
    'https://****.com/listviews/titles.php',
    ]
    allowed_domains = ["****.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('main#main'):
            yield {
                'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(),
            }

在 settings.py 我有一个用户代理轮换和 privoxy:

DOWNLOADER_MIDDLEWARES = {
        #user agent
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        '****.comm.rotate_useragent.RotateUserAgentMiddleware' :400,
        #privoxy
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        '****.middlewares.ProxyMiddleware': 100
    }

在 middlewares.py 我添加了:

from stem import Signal
from stem.control import Controller

def _set_new_ip():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate(password='tor_password')
        controller.signal(Signal.NEWNYM)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        _set_new_ip()
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

如果我取出 middlewares.py 中类的 def _set_new_ip(): 方法(并在 class ProxyMiddleware(object): 中调用它,蜘蛛就可以工作。但我希望蜘蛛每次都调用一个新 IP,这就是为什么我添加了它。问题是每次我尝试运行蜘蛛时它都会返回错误SocketError: [Errno 61] Connection refused,并带有此回溯:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
    response = yield method(request=request, spider=spider)
  File "/Users/nikita/scrapy/***/***/middlewares.py", line 71, in process_request
    _set_new_ip()
  File "/Users/nikita/scrapy/***/***/middlewares.py", line 65, in _set_new_ip
    with Controller.from_port(port=9051) as controller:
  File "/usr/local/lib/python2.7/site-packages/stem/control.py", line 998, in from_port
    control_port = stem.socket.ControlPort(address, port)
  File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 372, in __init__
    self.connect()
  File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 243, in connect
    self._socket = self._make_socket()
  File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 401, in _make_socket
    raise stem.SocketError(exc)
SocketError: [Errno 61] Connection refused
2017-07-11 15:50:28 [scrapy.core.engine] INFO: Closing spider (finished)

也许问题出在with Controller.from_port(port=9051) as controller: 中使用的端口上,但我不确定。如果有人有一个很棒的想法……

编辑---

好的,如果我打开浏览器并转到http://127.0.0.1:8118/,它是:

503 
This is Privoxy 3.0.26 on localhost (127.0.0.1), port 8118, enabled
Forwarding failure
Privoxy was unable to socks5-forward your request http://127.0.0.1:8118/ through localhost: SOCKS5 request failed

Just try again to see if this is a temporary problem, or check your forwarding settings and make sure that all forwarding servers are working correctly and listening where they are supposed to be listening.

所以可能和SOCKS5的配置有关……有人知道吗?

【问题讨论】:

  • 看看here如何使用stem连接到Tor。
  • 好的,在这个网站上,他们讨论了authenticate() 函数。在他们给出的示例中,他们首先创建一个control_socket = stem.socket.ControlPort(port = 9051),然后是stem.connection.authenticate(control_socket)。我应该把它们都放在ProxyMiddleware 类中吗?
  • 好的,我知道我必须在某个地方调用connect() 函数,但是,在哪里?我尝试了一些选项,但都没有成功……
  • 我有事,更新问题。
  • 您确定您有 Tor 正在运行,并且 Tor 的 Privoxy 设置正确且有效吗?

标签: python web-scraping scrapy tor


【解决方案1】:

我的猜测是:

  1. Tor 没有运行。要确定 Tor 是否正在运行,请在终端上运行 ps(例如,ps -ax | grep tor)和 netstat(例如,对于 mac:netstat -an | grep 'your tor portnumber'。对于 linux,将 -an 替换为 -tulnp)以查看是否Tor 真的在运行。
  2. 您没有正确设置转发设置。根据 503 错误消息,您似乎没有正确设置转发规则(如果 Tor 正在运行)。在 Privoxy 的配置文件中,确保 forward-socks5t / 127.0.0.1:9050 . 未注释。

【讨论】:

    【解决方案2】:
    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            def _set_new_ip():
                with Controller.from_port(port=9051) as controller:
                    controller.authenticate(password='PASSWORDHERE')
                    controller.signal(Signal.NEWNYM)
            request.meta['proxy'] = 'http://127.0.0.1:8118'
            spider.log('Proxy : %s' % request.meta['proxy'])
    

    【讨论】:

    • 正如目前所写,您的答案尚不清楚。请edit 添加其他详细信息,以帮助其他人了解这如何解决所提出的问题。你可以找到更多关于如何写好答案的信息in the help center
    猜你喜欢
    • 2011-08-02
    • 2014-09-28
    • 1970-01-01
    • 2017-04-10
    • 2019-08-07
    • 2013-02-24
    • 1970-01-01
    • 2014-03-01
    • 2012-09-24
    相关资源
    最近更新 更多