Python获取具有相同CSS类的html标签列表答案

【问题标题】：Python get list of html tags with same CSS classPython获取具有相同CSS类的html标签列表
【发布时间】：2020-08-16 21:49:30
【问题描述】：

我正在尝试解析 HTML 文档并将 css_class_name 字典添加到具有类似 CSS 类的标签列表中。

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">

<html>
   <head>
      <title>My first styled page</title>
   </head>
   <body>
      <!-- Site navigation menu -->
      <ul class="navbar">
         <li class="one"><a href="index.html">Home page</a>
         <li class="one"><a href="musings.html">Musings</a>
         <li class="two"><a href="town.html">My town</a>
         <li class="two"><a href="links.html">Links</a>
      </ul>
      <!-- Main content -->
      <div class = "three">
         <h1>My first styled page</h1>
         <p>Welcome to my styled page!
      </div>
      <div class = "four">
         <p>It lacks images, but at least it has style.
            And it has links, even if they don't go
            anywhere&hellip;
         </p>
      </div>
      <div class = "five">
         <p>There should be more here, but I don't know
            what yet.
      </div>
      <!-- Sign and date the page, it's only polite! -->
      <address>Made 5 April 2004<br>
         by myself.
      </address>
   </body>
</html>

我希望输出是如下的 python 字典：

{"one:[<li class="one"><a href="index.html">Home page</a>,
         <li class="one"><a href="musings.html">Musings</a>],
"two":[
         <li class="two"><a href="town.html">My town</a>,
         <li class="two"><a href="links.html">Links</a>],
"three":[<div class = "three">
         <h1>My first styled page</h1>
         <p>Welcome to my styled page!
      </div>],
"four":[<div class = "four">
         <p>It lacks images, but at least it has style.
            And it has links, even if they don't go
            anywhere&hellip;
         </p>
      </div>],
"five":[<div class = "five">
         <p>There should be more here, but I don't know
            what yet.
      </div>]
]}

我尝试探索美丽的汤 python 包并探索 API 以获得上述结果，但找不到任何有助于获得所需结果的特定 API/函数。在对标签进行分组之前，我需要知道 CSS 类，在这种情况下，这是以前不知道的。

【问题讨论】：

标签： python-3.x web-scraping beautifulsoup

【解决方案1】：

如果变量 txt 包含您问题中的 HTML 代码，则此脚本：

from bs4 import BeautifulSoup


soup = BeautifulSoup(txt, 'lxml')

out = {}
for tag in soup.find_all(lambda t: 'class' in t.attrs):
    for c in tag['class']:
        out.setdefault(c, []).append(tag)

# we don't want 'navbar' class:
del out['navbar']

from pprint import pprint
pprint(out)

打印：

{'five': [<div class="five">
<p>There should be more here, but I don't know
            what yet.
      </p></div>],
 'four': [<div class="four">
<p>It lacks images, but at least it has style.
            And it has links, even if they don't go
            anywhere…
         </p>
</div>],
 'one': [<li class="one"><a href="index.html">Home page</a>
</li>,
         <li class="one"><a href="musings.html">Musings</a>
</li>],
 'three': [<div class="three">
<h1>My first styled page</h1>
<p>Welcome to my styled page!
      </p></div>],
 'two': [<li class="two"><a href="town.html">My town</a>
</li>,
         <li class="two"><a href="links.html">Links</a>
</li>]}

【讨论】：