有没有办法解析网站内容的 DOM 树？ [复制]答案

【问题标题】：Is there any way to parse DOM tree for website content? [duplicate]有没有办法解析网站内容的 DOM 树？ [复制]
【发布时间】：2015-11-03 07:09:49
【问题描述】：

有一些用于从 xml 内容解析 dom 树的包，例如 https://docs.python.org/2/library/xml.dom.minidom.html。

但我不想针对 xml，只针对 html 网站页面内容。

from htmldom import htmldom
dom = htmldom.HtmlDom( "http://www.yahoo.com" ).createDom()
# Find all the links present on a page and prints its "href" value
a = dom.find( "a" )
for link in a:
    print( link.attr( "href" ) )

但为此我收到此错误：

Error while reading url: http://www.yahoo.com
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/htmldom/htmldom.py", line 333, in createDom
    raise Exception
Exception

请参阅我已经检查过 BeautifulSoup，但这不是我想要的。 Beautifulsoup 仅适用于 html 页面。如果页面内容使用 Javascript 动态加载，那么它会失败。我不想使用getElementByClassName 和类似方法解析元素。但是dom.children(0).children(1) 是这样的。

那么有什么方法可以像使用无头浏览器、selenium 那样我可以解析整个 DOM 树结构并通过子和子子来访问目标元素？

【问题讨论】：

标签： python selenium web-scraping phantomjs

【解决方案1】：

Python Selenium API 为您提供您可能需要的一切。你可以从

html = driver.find_element_by_tag_name("html")

或

body = driver.find_element_by_tag_name("body")

然后从那里开始

body.find_element_by_xpath('/*[' + str(x) + ']')

相当于“body.children(x-1)”。您不需要在此基础上使用 BeautifulSoup 或任何其他 DOM 遍历框架，但您当然可以通过获取页面源并让它由另一个库（如 BeautifulSoup）解析：

soup = BeautifulSoup(driver.page_source)
soup.html.children[0] #...

【讨论】：

【解决方案2】：

是的，但在 SO 帖子中包含代码还不够简单。不过你走在正确的轨道上。

基本上，您将需要使用您选择的无头渲染器（例如 Selenium）来下载所有资源并执行 javascript。在那里重新发明轮子真的没有用。

然后，您需要将无头渲染器中的 HTML 回显到页面就绪事件的文件中（我使用过的每个无头浏览器都提供此功能）。此时，您可以在该文件上使用 BeautifulSoup 来导航 DOM。 BeautifulSoup 确实支持基于孩子的遍历：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down

【讨论】：