如果在javascript中返回，如何抓取搜索结果（使用python）答案

【问题标题】：how to scrape search results if returned in javascript (using python)如果在javascript中返回，如何抓取搜索结果（使用python）
【发布时间】：2014-05-02 16:28:28
【问题描述】：

我要抓取的网站使用 JavaScript 填充返回。

我可以简单地以某种方式调用脚本并使用其结果吗？（当然，没有分页。）我不想运行整个东西来抓取生成的格式化 HTML，但原始源是空白的。

看看：http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0

回报的来源很简单

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/templates/base_template.xsl"?>
<content>
  <head>
    <SCRIPT type="text/javascript" src="/js/searchResultsView.js"></SCRIPT>    
  </head>
    <whitebox>
    <div id = "hits"></div>  
  </whitebox>
</content>

我更喜欢简单的 Python 工具。

【问题讨论】：

我只是在研究这个，但是试试 PhantomJS 和 Selenium WebDriver。我会尽力为您解答。

标签： javascript python web-scraping

【解决方案1】：

我下载了Selenium和ChromeDriver。

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0')

for e in driver.find_elements_by_class_name('result'):
    link = e.find_element_by_tag_name('a')
    print(link.text.encode('ascii', 'ignore'), link.get_attribute('href').encode('ascii', 'ignore'))

driver.quit()

如果您使用的是 Chrome，您可以使用 F12 检查页面属性，这非常有用。

【讨论】：

【解决方案2】：

确实，您可以使用 Python 做到这一点。您需要 python-ghost 或 Selenium。我更喜欢后者combined with PhantomJS，更轻更易于安装，使用方便：

使用 npm（节点包管理器）安装 phantomjs：

apt-get install nodejs
npm install phantomjs

安装硒：

pip install selenium

得到这样的结果页面，并像往常一样用beautifulSoup（或另一个lib）解析它：

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

【讨论】：

【解决方案3】：

简而言之：你不能只用 Python 做到这一点。

正如您所说，这是由 javascript (jquery) 填充的，它可以即时添加内容。

您可以尝试在本地使用 nodejs 运行脚本，并在某个时候将 DOM 转储为 html。但是无论如何你都需要深入研究js代码。

【讨论】：

谢谢，那么您能帮我（或帮助改写问题）如何运行正确的 JavaScript，例如使用 AppleScript 调用（“告诉应用程序 Google Chrome 执行 ....js”，但具体如何？）。如果您查看 .js 文件，我对它在“resp”中的返回感到满意，没有分页我只需要在 1998-2014 年每年运行一次。
nodejs 是 js 解释器，您可以使用它安装和运行 js 脚本。看看就好，用起来并不比python shell/interpreter难用。
会做，我不确定如何为这个远程函数指定函数参数，该函数是为处理包含它的页面中的查询而构建的。谢谢！