【问题标题】:Python get list of html tags with same CSS classPython获取具有相同CSS类的html标签列表
【发布时间】:2020-08-16 21:49:30
【问题描述】:

我正在尝试解析 HTML 文档并将 css_class_name 字典添加到具有类似 CSS 类的标签列表中。

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">

<html>
   <head>
      <title>My first styled page</title>
   </head>
   <body>
      <!-- Site navigation menu -->
      <ul class="navbar">
         <li class="one"><a href="index.html">Home page</a>
         <li class="one"><a href="musings.html">Musings</a>
         <li class="two"><a href="town.html">My town</a>
         <li class="two"><a href="links.html">Links</a>
      </ul>
      <!-- Main content -->
      <div class = "three">
         <h1>My first styled page</h1>
         <p>Welcome to my styled page!
      </div>
      <div class = "four">
         <p>It lacks images, but at least it has style.
            And it has links, even if they don't go
            anywhere&hellip;
         </p>
      </div>
      <div class = "five">
         <p>There should be more here, but I don't know
            what yet.
      </div>
      <!-- Sign and date the page, it's only polite! -->
      <address>Made 5 April 2004<br>
         by myself.
      </address>
   </body>
</html>

我希望输出是如下的 python 字典:

{"one:[<li class="one"><a href="index.html">Home page</a>,
         <li class="one"><a href="musings.html">Musings</a>],
"two":[
         <li class="two"><a href="town.html">My town</a>,
         <li class="two"><a href="links.html">Links</a>],
"three":[<div class = "three">
         <h1>My first styled page</h1>
         <p>Welcome to my styled page!
      </div>],
"four":[<div class = "four">
         <p>It lacks images, but at least it has style.
            And it has links, even if they don't go
            anywhere&hellip;
         </p>
      </div>],
"five":[<div class = "five">
         <p>There should be more here, but I don't know
            what yet.
      </div>]
]}

我尝试探索美丽的汤 python 包并探索 API 以获得上述结果,但找不到任何有助于获得所需结果的特定 API/函数。在对标签进行分组之前,我需要知道 CSS 类,在这种情况下,这是以前不知道的。

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    如果变量 txt 包含您问题中的 HTML 代码,则此脚本:

    from bs4 import BeautifulSoup
    
    
    soup = BeautifulSoup(txt, 'lxml')
    
    out = {}
    for tag in soup.find_all(lambda t: 'class' in t.attrs):
        for c in tag['class']:
            out.setdefault(c, []).append(tag)
    
    # we don't want 'navbar' class:
    del out['navbar']
    
    from pprint import pprint
    pprint(out)
    

    打印:

    {'five': [<div class="five">
    <p>There should be more here, but I don't know
                what yet.
          </p></div>],
     'four': [<div class="four">
    <p>It lacks images, but at least it has style.
                And it has links, even if they don't go
                anywhere…
             </p>
    </div>],
     'one': [<li class="one"><a href="index.html">Home page</a>
    </li>,
             <li class="one"><a href="musings.html">Musings</a>
    </li>],
     'three': [<div class="three">
    <h1>My first styled page</h1>
    <p>Welcome to my styled page!
          </p></div>],
     'two': [<li class="two"><a href="town.html">My town</a>
    </li>,
             <li class="two"><a href="links.html">Links</a>
    </li>]}
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-12-03
      • 1970-01-01
      • 2023-04-08
      • 1970-01-01
      • 2021-07-10
      • 1970-01-01
      • 1970-01-01
      • 2011-06-17
      相关资源
      最近更新 更多