在 Python 中浏览 HTML DOM答案

【问题标题】：Going through HTML DOM in Python在 Python 中浏览 HTML DOM
【发布时间】：2015-03-12 03:18:44
【问题描述】：

我正在寻找编写一个 Python 脚本（使用 3.4.3），它从 URL 中获取 HTML 页面，并可以通过 DOM 尝试查找特定元素。

我目前有这个：

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

当我打印内容时，它会打印出整个 html 页面，这与我想要的内容很接近……尽管我希望能够在 DOM 中导航，而不是将其视为一个巨大的字符串。

我对 Python 还很陌生，但对多种其他语言（主要是 Java、C#、C++、C、PHP、JS）有一定的经验。我以前用 Java 做过类似的事情，但想在 Python 中尝试一下。

感谢任何帮助。干杯！

【问题讨论】：

您应该为此使用BeautifulSoup 之类的东西。
接近与Parsing HTML Python重复。
你也可以使用lxml。

标签： python html dom httprequest

【解决方案1】：

您可以使用许多不同的模块。例如，lxml 或 BeautifulSoup。

这是一个lxml 示例：

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

还有一个BeautifulSoup 的例子：

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

请注意BeautifulSoup 如何返回一个 unicode 字符串，而 lxml 没有。根据需要，这可能有用/有害。

【讨论】：

似乎尝试使用 BeautifulSoup 会给我一个错误，因为我使用的是 Python 3.4.3。
'文件 "find.py"，第 3 行，在 from bs4 import BeautifulSoup File "C:\Users\Jake\Desktop\bs4_init_.py" ，第 175 行，除了异常，e: ^ SyntaxError: invalid syntax' 我查了一下，这似乎与它是 2.x 库的事实有关？
有人能告诉我为什么人们建议使用 BeautifulSoup 或 lxml 而不是原生 html 解析器吗？
@Shatu：一般来说，BeautifulSoup 和 lxml 这样的模块性能更好。
@Shatu：速度、内存使用等。我不确定它们在处理格式错误的数据时的表现如何

【解决方案2】：

查看BeautifulSoup 模块。

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))

【讨论】：

嗨，这可能很好地解决了问题......但如果你能编辑你的答案并提供一些关于它的工作原理和原因的解释会很好:)不要忘记 - 那里在 Stack Overflow 上是一堆新手，他们可以从你的专业知识中学到一两件事——对你来说显而易见的事情对他们来说可能不是。