使用带有标头和代理的 urllib 抓取网页数据答案

【问题标题】：Scraping web-page data with urllib with headers and proxy使用带有标头和代理的 urllib 抓取网页数据
【发布时间】：2016-05-05 18:25:49
【问题描述】：

我有网页数据，但现在我想通过代理获取它。我该怎么办？

import urllib

def get_main_html():
   request = urllib.request.Request(URL, headers=headers)
   doc = lh.parse(urllib.request.urlopen(request))
   return doc

【问题讨论】：

标签： python proxy web-scraping urllib http-proxy

【解决方案1】：

您可以使用socksipy

import ftplib 
import telnetlib 
import urllib2
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
#Route an FTP session through the SOCKS proxy
socks.wrapmodule(ftplib)
ftp = ftplib.FTP('cdimage.ubuntu.com') 
ftp.login('anonymous', 'support@aol.com') 
print ftp.dir('cdimage') ftp.close()
#Route a telnet connection through the SOCKS proxy
socks.wrapmodule(telnetlib) 
tn = telnetlib.Telnet('achaea.com') 
print tn.read_very_eager() tn.close()
#Route an HTTP request through the SOCKS proxy
socks.wrapmodule(urllib2) 
print urllib2.urlopen('http://www.whatismyip.com/automation/n09230945.asp').read()

在你的情况下：

import urllib
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
socks.wrapmodule(urllib)

def get_main_html():
   request = urllib.request.Request(URL, headers=headers)
   doc = lh.parse(urllib.request.urlopen(request))
   return doc

【讨论】：

【解决方案2】：

使用：

proxies = {'http': 'http://myproxy.example.com:1234'}
print "Using HTTP proxy %s" % proxies['http']
urllib.urlopen("http://yoursite", proxies=proxies)

【讨论】：

【解决方案3】：

来自文档

urllib 将自动检测您的代理设置并使用它们。这是通过 ProxyHandler 实现的，它是检测到代理设置时正常处理程序链的一部分。通常这是一件好事，但有时它可能没有帮助。一种方法是设置我们自己的 ProxyHandler，没有定义代理。这是使用与设置基本身份验证句柄类似的步骤完成的。

检查一下，https://docs.python.org/3/howto/urllib2.html#proxies

【讨论】：