找不到带有请求/BeautifulSoup 的元素答案

【问题标题】：Can not find element with requests/BeautifulSoup找不到带有请求/BeautifulSoup 的元素
【发布时间】：2013-11-17 18:18:34
【问题描述】：

我用请求和 BeautifulSoup 编写了一个网络爬虫，但 DOM 中有一个我找不到的元素。

这是我的工作：

import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.decitre.fr/rechercher/result/?q=victor+hugo&search-scope=3')
soup = BeautifulSoup(r.text)

我找不到的元素是“旧价格”（被删除的那个），当我使用浏览器开发工具检查 DOM 时可以看到它。

soup.find_all(class_='old-price') # returns [], no matter if I specify "span"

此外，我看不到汤中的“旧价格”字符串或请求的结果：

'old-price' in soup.text # False
'old-price' in r.text # False

当我用wget 获取源时我也看不到它。

我可以得到它的 div 父级，但在其中找不到价格子级：

commands = soup.find_all(class_='product_commande')
commands[0].find_all('old-price') # []

所以我不知道发生了什么。我错过了什么？

我用错了 request/BeautifulSoup 吗？（我不确定 r.text 是否返回完整的 html）
那个html部分是用javascript代码生成的吗？如果是这样，我怎么知道它，有没有办法获得完整的 html ？

非常感谢

【问题讨论】：

可能是在执行一些javascript 代码后生成了old-price 元素。
这是一个动态加载的 JavaScript 元素，所以可以尝试使用 python Ghost[jeanphix.me/Ghost.py/] 加载网站，然后通过 BeautifulSoup（或通过 JS-Query 的 Ghost）解析它的内容
看起来 Ghost 是最好的选择，谢谢。 «在 webkit 框架内执行 javascripts 是 Ghost 提供的最有趣的功能之一»。会尽快尝试。

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

在我的例子中，我将无效的 HTML 传递给 Beautiful Soup，这导致它忽略了文档开头无效标签之后的所有内容：

<!--?xml version="1.0" encoding="iso-8859-1"?-->

请注意，我也在使用Ghost.py。这是我删除标签的方法。

#remove invalid xml tag
ghostContent = ghost.content
invalidCode = '<!--?xml version="1.0" encoding="iso-8859-1"?-->'
if ghostContent.startswith(invalidCode):
    ghostContent = ghostContent[len(invalidCode):]

doc = BeautifulSoup(ghostContent)     

#test to see if we can find text   
if 'Application Search Results' in doc.text:
    print 'YES!'

【讨论】：