在没有启用 javascript 的网页上使用 mechanize 和 beautiful soup答案

【问题标题】：Use mechanize and beautiful soup on a webpage that did not enable javascript在没有启用 javascript 的网页上使用 mechanize 和 beautiful soup
【发布时间】：2015-08-17 22:18:00
【问题描述】：

我正在尝试抓取网页，但它需要我先登录。我是网络抓取的新手，所以请耐心等待我的代码：

import urllib
import urllib2
from bs4 import BeautifulSoup
import mechanize

browser = mechanize.Browser()
browser.addheaders = [('User-agent', 'Mozilla/5.0')]
browser.set_handle_robots(False)
browser.open('https://mywebsite.com')
# browser.select_form(name = 'form2')
# browser.form['Account Name'] = 'username'
# browser.form['Password'] = 'mypassword'
# browser.submit()

soup = BeautifulSoup(browser.response().read())
print soup

但是我得到了这个错误：

<html><head><script language="javascript">
<!--.
    .
    .
</script>
<noscript>
<title>No JavaScript Error</title>
<body>
<h3 align="center">Your Browser does not support JavaScript, or it is disabled.<br/>To run this application, you must enable JavaScript!!</h3>
</body></noscript></head></html>

【问题讨论】：

我不认为这是一个错误：很多页面都包含其标记的无脚本版本。如果它运行，它可能会被脚本替换。
我能做些什么来解决这个问题？

标签： javascript python html beautifulsoup mechanize

【解决方案1】：

尝试使用以下标头，服务器可能无法识别您的标头，因此可能会导致它认为您没有启用 javascript：

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36

注意：有些网站有防刮保护，您必须解决 javascript 难题才能获得实际内容。您可以使用 Js2Py 或任何其他 javascript 运行时。抓取此类网站要困难得多，但幸运的是很少有网站使用此系统。

【讨论】：