【问题标题】:Downloading a webpage using urllib2 results in garbled junk? (only sometimes)使用 urllib2 下载网页会导致乱码垃圾? (只有某些时候)
【发布时间】:2011-04-22 05:35:36
【问题描述】:

我怎么打开这个网页,我得到的是 HTML 文本:

http://itunes.apple.com/us/app/mobile/id381057839

但是当我点击这个网页时,我得到了乱码垃圾?

http://itunes.apple.com/us/app/mobile/id375562663

我在python中使用了相同的download()函数,在这里:

def download(source_url):
    try:
        socket.setdefaulttimeout(10)
        agent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 AlexaToolbar/alxf-1.54 Firefox/3.6.10 GTB7.1"
        ree = urllib2.Request(source_url)
        ree.add_header('User-Agent',agent)
        ree.add_header("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
        ree.add_header("Accept-Language","en-us,en;q=0.5")
        ree.add_header("Accept-Charset","ISO-8859-1,utf-8;q=0.7,*;q=0.7")
        ree.add_header("Accept-Encoding","gzip,deflate")
        ree.add_header("Host","itunes.apple.com")
        resp = urllib2.urlopen(ree)
        htmlSource = resp.read()
        return htmlSource
    except Exception, e:
        print e

【问题讨论】:

    标签: python http api rest urllib2


    【解决方案1】:

    解决了。这是压缩问题。

    def download(source_url):
        try:
            socket.setdefaulttimeout(10)
            agents = ['Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)','Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)','Microsoft Internet Explorer/4.0b1 (Windows 95)','Opera/8.00 (Windows NT 5.1; U; en)']
            ree = urllib2.Request(source_url)
            ree.add_header('User-Agent',random.choice(agents))
            ree.add_header('Accept-encoding', 'gzip')
            opener = urllib2.build_opener()
            h = opener.open(ree).read()
            import StringIO
            import gzip
    
            compressedstream = StringIO.StringIO(h)
            gzipper = gzip.GzipFile(fileobj=compressedstream)
            data = gzipper.read()
            return data
    
        except Exception, e:
            print e
            return ""
    

    【讨论】:

    • 嗯,也许你可以检查资源是否被压缩,除非你确定你总是会得到压缩响应。
    猜你喜欢
    • 2019-02-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-11-02
    • 1970-01-01
    • 2018-09-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多