BeautifulSoup：按类属性抓取表格——为什么我没有得到任何数据？答案

【问题标题】：BeautifulSoup: scraping a table by class attribute -- why don't I get any data?BeautifulSoup：按类属性抓取表格——为什么我没有得到任何数据？
【发布时间】：2014-07-24 12:26:51
【问题描述】：

我正在尝试使用 BeautifulSoup 抓取位于 here 的股票代码。目前，我尝试了以下方法：

import urllib
import BeautifulSoup
import re

url  = r'https://investor.vanguard.com/mutual-funds/vanguard-mutual-funds-list'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)

table = soup.findAll('td', attrs = {'class': re.compile(r'\bticker left\b')})

然而，这并没有给我任何东西。有人能解释一下为什么我不能用这个class 属性获得所有td 标签吗？ html 会让人认为这是可能的，而且相对轻松。例如：

<td class="ticker left">VUSXX              </td>

谢谢。

【问题讨论】：

标签： python beautifulsoup screen-scraping

【解决方案1】：

继续我上面的评论...您可以使用以下 url 返回所需的数据（从 firefox 扩展 Live HTTP Header 获得）

https://api.vanguard.com/rs/ire/02/ind/mf/month-end.jsonp?callback=callback

您也可以使用使用 Firefox 浏览器的 Selenium。

1) 安装 Selneium IDE http://docs.seleniumhq.org/download/

2) 安装 Selenium Python 模块https://pypi.python.org/pypi/selenium

然后你可以使用以下脚本..它将运行打开firefox浏览器..并获取结果。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
from bs4 import BeautifulSoup #use bs4 from now on.

browser = webdriver.Firefox()

browser.get('https://investor.vanguard.com/mutual-funds/vanguard-mutual-funds-list')

html = browser.page_source
soup = BeautifulSoup(html)

mydata = soup.find_all('tr')

而且，你可以在mydata找到你想要的

【讨论】：

【解决方案2】：

那是因为你阅读的页面是通过 AJAX 动态加载的。因此，Beautiful Soup 完成的读取不会捕获稍后动态加载的 AJAX 数据。您可以使用 Mechanize（Python 中的浏览器）和 BeautifulSoup 来做到这一点。

或者，您可以在进行 AJAX 调用后复制 HTML 页面的数据，然后使用 BeautifulSoup 进行解析。

【讨论】：

你能解释一下后一点吗？您的意思是将数据保存到文本文件中吗？另外，你怎么知道它是通过 AJAX 动态加载的？
是的，将数据保存到文本文件中。我只是把代码中的html变量打印出来，没有任何表格数据，然后访问页面，加载gif结束后加载数据。
这个 URL api.vanguard.com/rs/ire/02/ind/mf/… 包含我认为 json 格式的所有必需数据（一切）。我通过Firefox Extension Live HTTP Header（捕获活动）来查找url，并找到了上面的URL。