【问题标题】:Got messy codes when crawling using urllib2 (python 2.7)使用 urllib2 (python 2.7) 抓取时出现乱码
【发布时间】:2014-12-09 03:49:05
【问题描述】:

我使用了 urllib2 但响应是这样的:

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xed\x9d{s\xe2\xb8\x9a\x87\xff\xeeSu\xbe\x83\x0e\xc5\xce\xf4L\x11\xe2\x0b\xd7\x99\xee\xec*`\xc0\x1d_86\x84N\xb6\xb6\xa6\x1cp\x82\xa7\tfl\x93t\xce\xa7_\xc9@ \xc4\x18\xc5\t\xf1\xa8\xad\xae\x99t\xdb\xb1\xe5\xd7\x92~\xef\xa3W7\x7f\xfaWSo\xf4.\xba\x12\x18\x07\xb7\x13\xd0\xed\x9f*r\x03\xe4\x8e\x8e\x8f\x07b\xe3\xf8\xb8\xd9k\x82\xaf\x9d\x9e\xaa\x00\xbe\xc8\x81\x9egM}\'p\xdc\xa959>\x96\xb4\x1c\xc8\x8d\ 

我的代码是:

url = "http://fsr.merckresponsibility.com/fsr/service.do?"
params = {"page": 2, "sort": "name", "descending": "asc", "letter": "all", "keytype": "", "keywords": "", "rows": 80}
params = urllib.urlencode(params)
header = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
          "Accept-Encoding": "gzip, deflate, sdch",
          "Accept-Language": "en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4",
          "Connection": "keep-alive",
          "Cache-Control": "max-age=0",
          "Host": "fsr.merckresponsibility.com",
          "Cookie": "JSESSIONID=5D0AB9801BC9B522B043FC10C1705AF1.st3024;unique_visitor=60.254.142.39.1418022044678466; BIGipServerDMZ-04-Shared-HTTP=2926383",
          "Referer": "http://fsr.merckresponsibility.com/fsr/service.do",
          "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/39.0.2171.71 Safari/537.36"}

req = urllib2.Request(url + params, headers=header)
response = urllib2.urlopen(req).read()

谁能告诉我哪里出错了?

【问题讨论】:

    标签: python python-2.7 urllib2 urllib


    【解决方案1】:

    代码正在传递"Accept-Encoding":"gzip, deflate, sdch" 标头,这会导致服务器使用gzip 对内容进行编码。

    删除该标题将解决您的问题。


    如果要使用gzip编码,需要使用gzip module解压响应:

    ...
    response = urllib2.urlopen(req).read()
    
    import gzip
    import StringIO
    f = StringIO.StringIO(response)
    zf = gzip.GzipFile(fileobj=f)
    response = zf.read()
    

    【讨论】:

      猜你喜欢
      • 2011-11-06
      • 2011-08-23
      • 2014-06-03
      • 1970-01-01
      • 2012-12-05
      • 2014-01-12
      • 2015-12-03
      • 1970-01-01
      • 2018-07-24
      相关资源
      最近更新 更多