【问题标题】:Scraping web-page data with urllib with headers and proxy使用带有标头和代理的 urllib 抓取网页数据
【发布时间】:2016-05-05 18:25:49
【问题描述】:
我有网页数据,但现在我想通过代理获取它。我该怎么办?
import urllib
def get_main_html():
request = urllib.request.Request(URL, headers=headers)
doc = lh.parse(urllib.request.urlopen(request))
return doc
【问题讨论】:
标签:
python
proxy
web-scraping
urllib
http-proxy
【解决方案1】:
您可以使用socksipy
import ftplib
import telnetlib
import urllib2
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
#Route an FTP session through the SOCKS proxy
socks.wrapmodule(ftplib)
ftp = ftplib.FTP('cdimage.ubuntu.com')
ftp.login('anonymous', 'support@aol.com')
print ftp.dir('cdimage') ftp.close()
#Route a telnet connection through the SOCKS proxy
socks.wrapmodule(telnetlib)
tn = telnetlib.Telnet('achaea.com')
print tn.read_very_eager() tn.close()
#Route an HTTP request through the SOCKS proxy
socks.wrapmodule(urllib2)
print urllib2.urlopen('http://www.whatismyip.com/automation/n09230945.asp').read()
在你的情况下:
import urllib
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
socks.wrapmodule(urllib)
def get_main_html():
request = urllib.request.Request(URL, headers=headers)
doc = lh.parse(urllib.request.urlopen(request))
return doc
【解决方案2】:
使用:
proxies = {'http': 'http://myproxy.example.com:1234'}
print "Using HTTP proxy %s" % proxies['http']
urllib.urlopen("http://yoursite", proxies=proxies)