使用 Python 解析 HTML 文件：起点 [重复]答案

【问题标题】：Parsing HTML File using Python: the starting point [duplicate]使用 Python 解析 HTML 文件：起点 [重复]
【发布时间】：2012-05-11 15:53:08
【问题描述】：

我有以下格式的 html 文件。我想用python解析它。但是，我对使用 xml 模块一无所知。非常欢迎您的建议。

注意：再次对不起我的无知。问题不具体。但是，由于我对这样的解析脚本感到沮丧，我确实想得到一个由答案人（谢谢大家）描述的具体答案作为起点。希望你能理解。

<html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Weibo Landscape: Historical Archive of 800 Verified Accounts</title>
    </head>
    <body>
<div><br>
related 1-th-weibo:<br>
mid:3365546399651413<br>
score:-5.76427445942 <br>
uid:1893278624 <br>
link:<a href="http://weibo.com/1893278624/xrv9ZEuLX"  target="_blank">source</a> <br>
time:Thu Oct 06 17:10:59 +0800 2011 <br>
content: Zuccotti Park。 <br>
<br></div>
<div><br>
related 2-th-weibo:<br>
mid:3366839418074456<br>
score:-5.80535767804 <br>
uid:1813080181 <br>
link:<a href="http://weibo.com/1813080181/xs2NvxSxa"  target="_blank">source</a> <br>
time:Mon Oct 10 06:48:53 +0800 2011 <br>
content:rt the tweet <br>
rtMid:3366833975690765 <br>
rtUid:1893801487 <br>
rtContent:#ows#here is the content and the link http://t.cn/aFLBgr <br>
<br></div>

    </body>
    </html>

可能重复：
Extracting text from HTML file using Python

【问题讨论】：

关于用 Python 解析 HTML 的问题很多。请花几分钟四处搜索。在上面链接的问题中，请参阅HTMLParser 的示例
当然。我已经搜索过了，这不是我想要的。我希望结果更有条理，而不仅仅是将其转换为文本。
这只是一个例子——关于 HTML 解析有几个 Q 和 As：stackoverflow.com/search?q=python%20html%20parse
@FrankWANG：你决定要提取什么了吗？你试过什么？如果您正在寻找一个起点，那么还有许多其他问答可供您设置。您的问题目前过于笼统，您自己似乎没有做出任何努力。
@MattH，谢谢你的提醒。我尝试使用 xml 模块和 lxml 模块编写解析器。

标签： python html xml parsing

【解决方案1】：

我这样做是为了练习。如果这仍然有用的话，它应该能让你走上正轨。

# -*- coding: utf-8 -*-

from BeautifulSoup import BeautifulSoup


html = '''<html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Weibo Landscape: Historical Archive of 800 Verified Accounts</title>
    </head>
    <body>
<div><br>
related 1-th-weibo:<br>
mid:3365546399651413<br>
score:-5.76427445942 <br>
uid:1893278624 <br>
link:<a href="http://weibo.com/1893278624/xrv9ZEuLX"  target="_blank">source</a> <br>
time:Thu Oct 06 17:10:59 +0800 2011 <br>
content: Zuccotti Park。 <br>
<br></div>
<div><br>
related 2-th-weibo:<br>
mid:3366839418074456<br>
score:-5.80535767804 <br>
uid:1813080181 <br>
link:<a href="http://weibo.com/1813080181/xs2NvxSxa"  target="_blank">source</a> <br>
time:Mon Oct 10 06:48:53 +0800 2011 <br>
content:rt the tweet <br>
rtMid:3366833975690765 <br>
rtUid:1893801487 <br>
rtContent:#ows#here is the content and the link http://t.cn/aFLBgr <br>
<br></div>

    </body>
    </html>'''

data = []
soup = BeautifulSoup(html)
divs = soup.findAll('div')
for div in divs:
    div_string = str(div)
    div_string = div_string.replace('<br />', '')
    div_list = div_string.split('\n')
    div_list = div_list[1:-1]
    record = []
    for item in div_list:
        record.append( tuple(item.split(':', 1)) )
    data.append(record)

for record in data:
    for field in record:
        print field
    print '--------------'

使用您的示例数据，您将获得此输出。进一步处理应该很容易按摩到您想要的任何结构中。

('related 1-th-weibo', '')
('mid', '3365546399651413')
('score', '-5.76427445942 ')
('uid', '1893278624 ')
('link', '<a href="http://weibo.com/1893278624/xrv9ZEuLX" target="_blank">source</a> ')
('time', 'Thu Oct 06 17:10:59 +0800 2011 ')
('content', ' Zuccotti Park\xe3\x80\x82 ')
--------------
('related 2-th-weibo', '')
('mid', '3366839418074456')
('score', '-5.80535767804 ')
('uid', '1813080181 ')
('link', '<a href="http://weibo.com/1813080181/xs2NvxSxa" target="_blank">source</a> ')
('time', 'Mon Oct 10 06:48:53 +0800 2011 ')
('content', 'rt the tweet ')
('rtMid', '3366833975690765 ')
('rtUid', '1893801487 ')
('rtContent', '#ows#here is the content and the link http://t.cn/aFLBgr ')

【讨论】：

这是一个很好的答案。谢谢你。

【解决方案2】：

我建议你看看 Python 库BeautifulSoup。它可以帮助您导航和搜索 HTML 数据。

【讨论】：