【问题标题】:How would I parse this HTML using BeautifulSoup?我将如何使用 BeautifulSoup 解析这个 HTML?
【发布时间】:2021-02-11 15:21:44
【问题描述】:

我正在尝试使用 Python 和 BeautifulSoup 模块从 Acharts.co 抓取前 100 首歌曲排行榜。到目前为止,我已经设法获得了图表中给定位置的歌曲标题,但在获得艺术家姓名方面我有点卡住了。

import requests
from bs4 import BeautifulSoup

url = "https://acharts.co/canada_singles_top_100/2021/05"

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en,de;q=0.9,en-US;q=0.8,fr-FR;q=0.7,fr;q=0.6,es;q=0.5",  
    "authority": "acharts.co", 
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 YaBrowser/17.6.1.749 Yowser/2.5 Safari/537.36"
}    

response = requests.get(url, headers=headers)
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.select("td"):
    if item['class'][0] == 'cPrinciple':
        song = item.a.span.get_text()
        print(song)

这是我要解析的 HTML 部分:

<td class="cPrinciple" itemprop="item" itemscope itemtype="http://schema.org/MusicRecording">
    <a href="https://acharts.co/song/156580" itemprop="url"><span itemprop="name">Mood</span></a>



    <br />
    <span class="Sub">
            <span itemprop="byArtist" itemscope itemtype="http://schema.org/MusicGroup">
                <meta itemprop="url" content="https://acharts.co/artist/24kgoldn" />
                <span itemprop="name">24Kgoldn</span>
            </span> and 
            <span itemprop="byArtist" itemscope itemtype="http://schema.org/MusicGroup">
                <meta itemprop="url" content="https://acharts.co/artist/iann_dior" />
                <span itemprop="name">Iann Dior</span>
            </span>
    </span>

那么在上面的 sn-p 中,我将如何提取“Mood”(歌曲名称)、“24kGldn”(艺术家#1)和“Iann Dior”(艺术家#2)? 提前致谢

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    你可以这样做:

    soup = BeautifulSoup(response.text, 'html.parser')
    for item in soup.select("td"):
        if item['class'][0] == 'cPrinciple':
            e = item.find("span", { "class" : "Sub" })
            if e is not None:
                results= e.find_all("span",{"itemprop":"name"})
                artists = [x.text for x in results]
            song = item.a.span.get_text()
            print(artists)
            print(song)
    

    【讨论】:

      【解决方案2】:

      更紧凑的方式(使用列表理解):

      import requests as rq
      from bs4 import BeautifulSoup as bs
      
      headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 YaBrowser/17.6.1.749 Yowser/2.5 Safari/537.36"}
      url = "https://acharts.co/canada_singles_top_100/2021/05"
      resp = rq.get(url, headers=headers)
      soup = bs(resp.content)
      
      tbody = soup.find_all("tbody")[0]
      
      rows = [[span.text for span in row.find_all("span", attrs={"itemprop": True}) if not "\n" in span.text] for row in tbody.find_all("tr")]
      

      【讨论】:

        猜你喜欢
        • 2012-05-11
        • 1970-01-01
        • 2013-03-10
        • 2017-03-14
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-12-13
        • 1970-01-01
        相关资源
        最近更新 更多