【问题标题】:urllib2, mechanise not returning same as browser - what else to spoof?urllib2,机制不返回与浏览器相同 - 还有什么可以欺骗的?
【发布时间】:2012-06-11 16:08:11
【问题描述】:

我正在尝试创建一个脚本(纯粹是为了学习目的)来用几个不同的字典翻译给定的单词。我完成了两个,使用 urllib2 和 beautifulsoup 来获取和解析翻译,然后转到谷歌翻译。

我很快发现它返回 403 禁止错误。添加用户代理会得到翻译,但只有一个单词的翻译。举例来说,转到http://translate.google.com/?text=test&sl=en&tl=es,您将获得翻译(在名为“hps”的课程中)和动词、名词和形容词列表。但是使用下面的脚本和 html 是不同的,只返回主要翻译,并且在一个

span id=result_box

找不到动词、名词等。

在这个过程中,我通过谷歌搜索了一番,我意识到现在有一​​个 API - 而不是免费的。我不打算发布任何最终脚本,也不打算用它来违反任何 TOS,但现在我最感兴趣的是为什么浏览器和 urllib 等之间的区别。

为此,我尝试了纯 urllib2 与用户代理,并机械化 - 如下所示。所以,我的问题是——除了用户代理,浏览器和 python 脚本还有什么区别?我曾尝试使用萤火虫,但没有任何反应(尽管我是个菜鸟)。谢谢!

编辑:来自 firebug 的请求标头和我的脚本如下。

import mechanize
import re
import cookielib

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('http://translate.google.com/?text=test&sl=en&tl=es')
html = r.read()
match = re.findall(r'verb', html)

print match

萤火虫:

GET /?text=test&sl=en&tl=es HTTP/1.1

Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset  ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding gzip, deflate
Accept-Language en-us,en;q=0.5
Connection  keep-alive
Cookie  PREF=ID=298b435815ef8553:U=e7dad4baf65f083b:FF=0:LD=en:CR=2:TM=1327516863:LM=1339428154:S=maktYFZEHXXpMDFg; NID=60=U229h4lzOnjpHyidbhgYecCx72Myp_-XHgupW-R_mWtpuOveDdIOO1uLBq-6ltn-ER15ppJryR7yYOYEhkCfUCl45qNz5aymBQ1CGDHS4UcHu2oIDYAHut0ctnlL76eDW3n7kjOWoz5wNH6NMw
Host    translate.google.com
User-Agent  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20100101 Firefox/9.0

脚本:

send: 'GET /?text=test&sl=en&tl=es HTTP/1.1\r\nAccept-Encoding: 身份\r\n主机: translate.google.com\r\n连接: 关闭\r\n用户代理:Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1\r\n\r\n' 回复:'HTTP/1.1 200 OK\r\n' 标头:日期:星期一,2012 年 6 月 11 日 16:13:42 格林威治标准时间

标题:到期:格林威治标准时间 1990 年 1 月 1 日星期五 00:00:00

header: Cache-Control: no-cache, must-revalidate

header: Pragma: no-cache

标题:X-Frame-Options:SAMEORIGIN

header: Content-Type: text/html; charset=UTF-8

标题:内容语言:zh

标题:设置 Cookie: PREF=ID=6dd42f2264250d7c:TM=1333431222:LM=1339454222:S=k6JXSoGGaAMNmPEo; 过期=格林威治标准时间 2014 年 6 月 11 日星期三 16:13:42;路径=/;域=.google.com

标题:设置 Cookie: NID=60=f8czmR413h3sKUGJUUM4PLKl2O7SUtqfW5hss5O54sRKoErf9wIEU4Wu2WCuHzWTJQ3p1Rj7dQv1B4BBmSMY1tmfus7UZGCYFIKaXoKwklZ9tZsr5vds8vvvFjRdZyevn; 到期=格林威治标准时间 2012 年 12 月 11 日星期二 16:13:42;路径=/;域=.google.com; HttpOnly

header: P3P: CP="这不是 P3P 策略!请参阅 http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 了解更多信息。”

标题:X-Content-Type-Options: nosniff

标头:服务器:HTTP 服务器(未知)

标题:X-XSS-保护:1;模式=块

标题:连接:关闭

【问题讨论】:

    标签: python mechanize urllib2


    【解决方案1】:

    找不到动词、形容词,因为它们是通过 AJAX 调用加载的。您的 mechanize 浏览器没有 javascript。因此它不能做任何 AJAX。但是,如果您可以查看浏览器的检查器或其他东西,您会看到调用的标题、URL 和参数。现在剩下要做的就是模仿呼叫。

    我卷曲它,我得到了一个 JSON 响应:

    thrustmaster@thrustmaster:~/Temp$ curl 'http://translate.google.com/translate_a/t?client=t&text=test&hl=en&sl=en&tl=es&multires=1&ssel=0&tsel=0&sc=1' -H 'User-Agent: blah'
    [[["prueba","test","",""]],[["noun",["prueba","ensayo","test","examen","an�lisis","criterio","toque","ejercicio","tanteo"],[["prueba",["test","proof","evidence","trial","event","race"]],["ensayo",["test","trial","essay","assay","testing","rehearsal"]],["test",["test"]],["examen",["examination","review","exam","test","inspection","quiz"]],["an�lisis",["analysis","test","review","assay","breakdown"]],["criterio",["criterion","judgment","standard","test","view","yardstick"]],["toque",["touch","stroke","test","knock","blast","chime"]],["ejercicio",["exercise","practice","drill","practicing","test","prosecution"]],["tanteo",["score","scoring","trial","test","try","calculation"]]]],["adjective",["de prueba"],[["de prueba",["test","testing","trial","probationary","corrective"]]]],["verb",["probar","comprobar","ensayar","examinar","poner a prueba","experimentar","someter a prueba","interrogar","hacer investigaciones","justificar","graduar"],[["probar",["test","try","prove","taste","try out","sample"]],["comprobar",["check","test","prove","ascertain","make sure","substantiate"]],["ensayar",["test","try","rehearse","try out","assay","essay"]],["examinar",["examine","consider","review","look at","explore","test"]],["poner a prueba",["test","try","try out","prove","tempt","put through his paces"]],["experimentar",["experience","experiment","undergo","experiment with","feel","test"]],["someter a prueba",["test","try out","touch"]],["interrogar",["question","interrogate","examine","cross-examine","ask","test"]],["hacer investigaciones",["test"]],["justificar",["justify","warrant","substantiate","prove","make good","test"]],["graduar",["graduate","grade","calibrate","time","test"]]]]],"en",,[["prueba",[5],1,0,1000,0,1,0]],[["test",4,,,""],["test",5,[["prueba",1000,1,0],["prueba de",0,1,0],["ensayo",0,1,0],["de prueba",0,1,0],["test",0,1,0]],[[0,4]],"test"]],,,[["en"]],5]thrustmaster@thrustmaster:~/Temp$ 
    

    现在,可能在您的脚本中,您必须从以下 URL 获取响应:

    http://translate.google.com/translate_a/t?client=t&text=test&hl=en&sl=en&tl=es&multires=1&ssel=0&tsel=0&sc=1

    PS:

    正如您所说,如果您打算使用此脚本,这可能是 TOS 问题。它始终是在 API 上使用的更好选择。您所依赖的 HTML 可以随时更改。

    【讨论】:

    • 非常感谢!在这里,是否有任何简单的方法可以像浏览器一样通过 python 加载页面以及 AJAX 等?
    • 曾经用它作为一个web应用的测试工具,但是感觉它主要是用来测试gui的。
    • @Thrustmaster 这些官方 API 是付费的。这是无法接受的。我不经营企业,我是使用 CLI 而不是 Web 浏览器的临时用户。但是谁在乎少数族裔,对吧……
    猜你喜欢
    • 2011-03-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-05-27
    • 2015-12-30
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多