为什么 BeautifulSoup 没有找到特定的表类？ [关闭]答案

【问题标题】：Why is BeautifulSoup not finding a specific table class? [closed]为什么 BeautifulSoup 没有找到特定的表类？ [关闭]
【发布时间】：2014-03-06 11:18:29
【问题描述】：

我正在使用 Beautiful Soup 尝试从 Oil-Price.net 上刮掉商品表。我可以找到第一个 div、表格、表格主体和表格主体的行。但是在其中一行中有一列我无法使用 Beautiful soup 找到。当我告诉 python 打印该特定行中的所有表时，它不会显示我想要的那个。这是我的代码：

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://oil-price.net').read()
soup = BeautifulSoup(html)

div = soup.find("div",{"id":"cntPos"})
table1 = div.find("table",{"class":"cntTb"})
tb1_body = table1.find("tbody")
tb1_rows = tb1_body.find_all("tr")
tb1_row = tb1_rows[1]
td = tb1_row.find("td",{"class":"cntBoxGreyLnk"})
print td

它打印的都是无。我什至尝试打印每一行以查看是否可以手动找到该列而什么也没有。 ``它将向其他人展示。但不是我想要的。

【问题讨论】：

"但是其中一行中有一列我找不到" 什么列，什么行？
@KaranGoel：包含一个 URL，我能够重现该问题。
Missing parts on Beautiful Soup results 的可能重复项

标签： python web-scraping beautifulsoup

【解决方案1】：

页面使用了损坏的 HTML，不同的解析器会尝试以不同的方式修复它。安装 lxml 解析器，它可以更好地解析该页面：

>>> BeautifulSoup(html, 'html.parser').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
True
>>> BeautifulSoup(html, 'lxml').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
False

这并不意味着lxml 会比其他解析器选项更好地处理所有损坏的 HTML。另请参阅html5lib，这是WHATWG HTML spec 的纯 Python 实现，因此更接近于当前浏览器实现如何处理损坏的 HTML。

【讨论】：

BeautifulSoup(html, 'html5lib') 解决了我的问题
@Barny：是的，同样，不同的解析器可以以不同的方式处理损坏的 HTML。 html5lib 实现了大多数现代浏览器非常接近的选择，但由于它是在纯 Python 中实现的，所以速度也较慢。

【解决方案2】：

查看页面源码：

<td class="cntBoxGreyLnk" rowspan="2" valign="top">
    <script type="text/javascript" src="http://www.oil-price.net/COMMODITIES/gen.php?lang=en"></script>
    <noscript> To get live <a href="http://www.oil-price.net/dashboard.php?lang=en#COMMODITIES">gold, oil and commodity price</a>, please enable Javascript.</noscript>

你想要的数据被动态加载到页面中；您无法使用 BeautifulSoup 获得它，因为它在 HTML 中不存在。

如果您查看链接的脚本网址http://www.oil-price.net/COMMODITIES/gen.php?lang=en 你会看到一堆类似的 javascript

document.writeln('<table summary=\"Crude oil and commodity prices (c) http://oil-price.net\" style=\"font-family: Lucida Sans Unicode, Lucida Grande, Sans-Serif; font-size: 12px; background: #fff; border-collapse: collapse; text-align: left; border-color: #6678b1; border-width: 1px 1px 1px 1px; border-style: solid;\">');
document.writeln('<thead>');
/* ... */
document.writeln('<tr>');
document.writeln('<td style=\"font-size: 12px; font-weight: bold; border-bottom: 1px solid #ccc; color: #1869bd; padding: 2px 6px; white-space: nowrap;\">');
document.writeln('<a href=\"http://oil-price.net/dashboard.php?lang=en#COMMODITIES\"  style=\"color: #1869bd; text-decoration:none\">Heating Oil<\/a>');
document.writeln('<\/td>');
document.writeln('<td style=\"font-size: 12px; font-weight: normal; border-bottom: 1px solid #ccc; color: #000000; padding: 2px 6px; white-space: nowrap;\">');
document.writeln('3.05');
document.writeln('<\/td>');
document.writeln('<td style=\"font-size: 12px; font-weight: normal; border-bottom: 1px solid #ccc; color: green;    padding: 2px 6px; white-space: nowrap;\">');
document.writeln('+1.81%');
document.writeln('<\/td><\/tr>');

当页面加载时，此 javascript 会运行并动态写入您要查找的值。（顺便说一句：这是一种完全过时、被贬低且普遍可怕的做事方式；我只能假设有人认为这是额外的安全保障。他们应该因为他们的冒失而受到惩罚！ em>)。

现在，这段代码非常简单；您可能可以使用正则表达式获取 html 数据。但是 (a) 有一些转义码可能会导致问题，(b) 不能保证他们将来不会混淆他们的代码，以及 (c) 那里的乐趣在哪里？

PyV8 module 提供了一种直接从 Python 执行 javascript 代码的方法，甚至允许我们编写可调用 javascript 的 Python 代码！我们将利用这一点以不可混淆的方式获取数据：

import PyV8
import requests
from bs4 import BeautifulSoup

SCRIPT = "http://www.oil-price.net/COMMODITIES/gen.php?lang=en"

class Document:
    def __init__(self):
        self.lines = []

    def writeln(self, s):
        self.lines.append(s)

    @property
    def content(self):
        return '\n'.join(self.lines)

class DOM(PyV8.JSClass):
    def __init__(self):
        self.document = Document()

def main():
    # Create a javascript context which contains
    #   a document object having a writeln method.
    # This allows us to capture the calls to document.writeln()
    dom  = DOM()
    ctxt = PyV8.JSContext(dom)
    ctxt.enter()

    # Grab the javascript and execute it
    js = requests.get(SCRIPT).content
    ctxt.eval(js)

    # The result is the HTML code you are looking for
    html = dom.document.content

    # html is now "<table> ... </table>" containing the data you are after;
    # you can go ahead and finish parsing it with BeautifulSoup
    tbl = BeautifulSoup(html)
    for row in tbl.findAll('tr'):
        print(' / '.join(td.text.strip() for td in row.findAll('td')))

if __name__ == "__main__":
    main()

这会导致：

Crude Oil / 99.88 / +2.04%
Natural Gas / 4.78 / -3.27%
Gasoline / 2.75 / +2.40%
Heating Oil / 3.05 / +1.81%
Gold / 1263.30 / +0.45%
Silver / 19.92 / +0.06%
Copper / 3.27 / +0.37%

这是你想要的数据。

编辑：我真的不能再把它弄糊涂了；这是完成这项工作的最低限度的代码。但也许我可以更好地解释它是如何工作的（它真的没有看起来那么可怕！）：

PyV8 模块以 Python 可以与之交互的方式包装了 Google 的 V8 javascript 解释器。您需要先到https://code.google.com/p/pyv8/downloads/list 下载并安装相应的版本，然后才能运行我的代码。

javascript 语言本身并不知道如何与外界交互；它没有内置的输入或输出方法。这不是非常有用。为了解决这个问题，我们可以传入一个“上下文对象”，其中包含有关外部世界以及如何与之交互的信息。当 javascript 在网络浏览器中运行时，它会获取一个上下文对象，该对象提供有关浏览器和当前网页以及如何与它们交互的各种信息。

http://www.oil-price.net/COMMODITIES/gen.php?lang=en 的 javascript 代码假定它将在浏览器中运行，其中上下文有一个表示网页的“文档”对象，该对象有一个“writeln”方法，可以将文本附加到当前的末尾网页。随着页面的加载，脚本被加载并运行；它将文本（恰好是有效的 HTML）写入页面；这将作为页面的一部分呈现，最终成为您想要的 Commodities 表。您无法使用 BeautifulSoup 获取表，因为该表在 javascript 运行之前不存在，并且 BeautifulSoup 不会加载或运行 javascript。

我们要运行 javascript；为此，我们需要一个假的浏览器上下文，它有一个带有“writeln”方法的“document”对象。然后我们需要存储传递给“writeln”的信息，并且我们需要一种在脚本完成时将其取回的方法。我的 DOM 类是假的浏览器上下文；当实例化时（即当我们创建其中一个时），它给自己一个名为 document 的 Document 对象，它有一个 writeln 方法。当 document.writeln 被调用时，它会将文本行追加到 document.lines 中，并且我们可以随时调用 document.content 来取回到目前为止写入的所有文本。

现在：行动！在 main 函数中，我们创建了一个假的浏览器上下文，将其设置为解释器的当前上下文，然后启动解释器。我们抓取 javascript 代码，并告诉解释器评估（即运行）它。（源代码混淆，会搞砸静态分析，不会影响我们，因为代码在运行时必须产生良好的输出，而我们实际上是在运行它！）代码完成后，我们从文档中获取最终输出。语境;这是您无法获得的表格 html。我们将其传回 BeautifulSoup 以提取数据，然后打印数据。

希望有帮助！

【讨论】：

我不想这么说，但是对于新手来说，无论如何你都可以把它弄糊涂。我还没有深入研究类/对象/方法。
问题不是动态加载；桌子就在那里。但是 HTML 已损坏，标准的 html.parser 解析器无法修复损坏。
@MartijnPieters：你真的尝试过这个吗？如果您在浏览器中加载页面oil-price.net，然后使用“查看源代码”，表格就在那里，因为 javascript 已经运行。如果您使用 urllib3 或请求加载它，则不是，因为它没有。请在“纠正”我之前自己检查一下：import urllib2、html = urllib2.urlopen('http://oil-price.net').read().splitlines()、print('\n'.join(html[588:594]))。
@HughBothwell：我费心去尝试，是的。您可以在我上面的回答中看到。可能有 more 内容正在加载，但 OP 的原始问题是由于 HTML 未正确解析。
@MartijnPieters：问题是动态加载，因为你的答案得到一个表没有数据。