利用python爬去数据小记。

两天时间写了个爬虫，问题多多

1.正则表达式考虑不完整，数据出现浮点数，整数，没有都有可能，正则表达式不规范要么取不到数据，要么取到错误的数据，

由于没有及时检查，导致整个攻击增加没有取到，简直是严重失误，不过也只能这样了

2. python 版本更替，2.7和3.3 差别太多，内部编码方式和整个urllib库都有很多改动，导致网上代码参考价值较小，给整个变成过程造成了阻碍

等发现这个问题的时候，基本代码已经写完，再改也来不及了

3.事先调研不清楚，LOL数据在很多网站都有，开始写爬虫的时候并没有调查多个网站，而是随便选了一个178作为爬取网站，结果遇到了反爬虫系统（估计是），如果多调查几个网站，说不定能事半功倍。

4.没有反爬虫的意识，对HTTP协议的理解也不够深刻

下次值得改善，加入浏览器头文件，伪造访问

http://www.cnblogs.com/txw1958/archive/2011/12/21/2295698.html

urllib3.3的常用函数

1、最简单

import urllib.request
response = urllib.request.urlopen(\'http://python.org/\')
html = response.read()

2、使用 Request

import urllib.request

req = urllib.request.Request(\'http://python.org/\')
response = urllib.request.urlopen(req)
the_page = response.read()

3、发送数据

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request

url = \'http://localhost/login.php\'
user_agent = \'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)\'
values = {
\'act\' : \'login\',
\'login[email]\' : \'yzhang@i9i8.com\',
\'login[password]\' : \'123456\'
         }

data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header(\'Referer\', \'http://www.python.org/\')
response = urllib.request.urlopen(req)
the_page = response.read()

print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request

url = \'http://localhost/login.php\'
user_agent = \'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)\'
values = {
\'act\' : \'login\',
\'login[email]\' : \'yzhang@i9i8.com\',
\'login[password]\' : \'123456\'
         }
headers = { \'User-Agent\' : user_agent }

data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()

print(the_page.decode("utf8"))

5、http 错误

#! /usr/bin/env python3
 
import urllib.request

req = urllib.request.Request(\'http://www.python.org/fish.html\')
try:
    urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.code)
print(e.read().decode("utf8"))

6、异常处理1

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://twitter.com/")
try:
    response = urlopen(req)
except HTTPError as e:
print(\'The server couldn\\'t fulfill the request.\')
print(\'Error code: \', e.code)
except URLError as e:
print(\'We failed to reach a server.\')
print(\'Reason: \', e.reason)
else:
print("good!")
print(response.read().decode("utf8"))

7、异常处理2

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import  URLError
req = Request("http://twitter.com/")
try:
    response = urlopen(req)
except URLError as e:
if hasattr(e, \'reason\'):
print(\'We failed to reach a server.\')
print(\'Reason: \', e.reason)
elif hasattr(e, \'code\'):
print(\'The server couldn\\'t fulfill the request.\')
print(\'Error code: \', e.code)
else:
print("good!")
print(response.read().decode("utf8"))

8、HTTP 认证

#! /usr/bin/env python3
 
import urllib.request

# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://cms.tetx.com/"
password_mgr.add_password(None, top_level_url, \'yzhang\', \'cccddd\')

handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL
a_url = "https://cms.tetx.com/"
x = opener.open(a_url)
print(x.read())

# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)

a = urllib.request.urlopen(a_url).read().decode(\'utf8\')
print(a)

9、使用代理

#! /usr/bin/env python3
 
import urllib.request

proxy_support = urllib.request.ProxyHandler({\'sock5\': \'localhost:1080\'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)


a = urllib.request.urlopen("http://g.cn").read().decode("utf8")
print(a)

10、超时

#! /usr/bin/env python3
 
import socket
import urllib.request

# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request(\'http://twitter.com/\')
a = urllib.request.urlopen(req).read()
print(a)

http://docs.python.org/release/3.2/howto/urllib2.html 官方文档