【发布时间】:2020-08-16 21:49:30
【问题描述】:
我正在尝试解析 HTML 文档并将 css_class_name 字典添加到具有类似 CSS 类的标签列表中。
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>My first styled page</title>
</head>
<body>
<!-- Site navigation menu -->
<ul class="navbar">
<li class="one"><a href="index.html">Home page</a>
<li class="one"><a href="musings.html">Musings</a>
<li class="two"><a href="town.html">My town</a>
<li class="two"><a href="links.html">Links</a>
</ul>
<!-- Main content -->
<div class = "three">
<h1>My first styled page</h1>
<p>Welcome to my styled page!
</div>
<div class = "four">
<p>It lacks images, but at least it has style.
And it has links, even if they don't go
anywhere…
</p>
</div>
<div class = "five">
<p>There should be more here, but I don't know
what yet.
</div>
<!-- Sign and date the page, it's only polite! -->
<address>Made 5 April 2004<br>
by myself.
</address>
</body>
</html>
我希望输出是如下的 python 字典:
{"one:[<li class="one"><a href="index.html">Home page</a>,
<li class="one"><a href="musings.html">Musings</a>],
"two":[
<li class="two"><a href="town.html">My town</a>,
<li class="two"><a href="links.html">Links</a>],
"three":[<div class = "three">
<h1>My first styled page</h1>
<p>Welcome to my styled page!
</div>],
"four":[<div class = "four">
<p>It lacks images, but at least it has style.
And it has links, even if they don't go
anywhere…
</p>
</div>],
"five":[<div class = "five">
<p>There should be more here, but I don't know
what yet.
</div>]
]}
我尝试探索美丽的汤 python 包并探索 API 以获得上述结果,但找不到任何有助于获得所需结果的特定 API/函数。在对标签进行分组之前,我需要知道 CSS 类,在这种情况下,这是以前不知道的。
【问题讨论】:
标签: python-3.x web-scraping beautifulsoup