【问题标题】:Python BeautifulSoup Scraper - apply function to each <li> element in <ol> on pagePython BeautifulSoup Scraper - 将函数应用于页面上 <ol> 中的每个 <li> 元素
【发布时间】:2021-10-26 22:18:41
【问题描述】:

我们正在抓取 Billboard 的热门 100 名单 https://www.billboard.com/charts/hot-100/2021-10-30 并且有一些不错的代码,但很难完成:

from bs4 import BeautifulSoup 
import requests
import pandas as pd

def CleanBullet(bullet):
    this_rank = all_bullets[0].find("span", class_="chart-element__rank").get_text().strip('\n').strip('\n').strip('Rising')
    this_song = all_bullets[0].find("span", class_="chart-element__information__song").get_text().strip('\n')
    this_artist = all_bullets[0].find("span", class_="chart-element__information__artist").get_text().strip('\n')
    this_last_week = all_bullets[0].find("span", class_="text--last").get_text().strip(' Last Week')
    this_peak = all_bullets[0].find("span", class_="text--peak").get_text().strip(' Peak Rank')
    this_weeks_on = all_bullets[0].find("span", class_="text--week").get_text().strip(' Weeks on Chart')

    this_df = pd.DataFrame()
    data={
        'rank': this_rank,
        'song': this_song,
        'artist': this_artist,
        'last_week': this_last_week,
        'peak': this_peak,
        'weeks_on': this_weeks_on
    }
    this_df = this_df.append(data, ignore_index=True)
    return this_df


base_url = "https://www.billboard.com/charts/hot-100/2021-10-30"
response = requests.get(base_url)
web_page = response.text
soup = BeautifulSoup(web_page, "html.parser")    
full_table = soup.find("ol", class_="chart-list__elements").find_all("li")

df1 = CleanBullet(full_table[0])
df1

我们怎么做:

  • full_table 中的 100 个元素中的每一个元素应用此函数,从而生成一个包含 100 行的数据框?
  • 删除排名列中的\n,因为strip('\n') 似乎不起作用...

【问题讨论】:

  • strip() 默认删除所有空格(包括换行符)
  • strip() 在上面的例子中没有像我需要的那样删除\n 换行符。

标签: python pandas beautifulsoup


【解决方案1】:

我可能只是从包含所有这些信息的 JavaScript 对象中提取所有数据。使用列表推导生成字典列表并转换为 df。

import requests
import pandas as pd
import re
import json
import html

r = requests.get('https://www.billboard.com/charts/hot-100/2021-10-30')
data = json.loads(html.unescape(re.search(r'data-charts="(.*)"',
                  r.text).group(1)))
df = pd.DataFrame(
    [{'rank': i['rank'],
     'song': i['title'],
     'artist': i['artist_name'],
     'last_week': str(i['history']['last_week']).split('.')[0],
     'peak': i['history']['peak_rank'],
     'weeks_on': i['history']['weeks_on_chart']} for i in data]
)
# df.to_csv('top100.csv', index = False)

旁注:

我很想知道为什么以下内容仍然可以正常工作:

df = pd.DataFrame(
        {'rank': i['rank'],
         'song': i['title'],
         'artist': i['artist_name'],
         'last_week': str(i['history']['last_week']).split('.')[0],
         'peak': i['history']['peak_rank'],
         'weeks_on': i['history']['weeks_on_chart']} for i in data
    )

我假设正在进行某种非常低效的复制。

【讨论】:

    【解决方案2】:

    这个怎么样?

    from bs4 import BeautifulSoup 
    import requests
    import pandas as pd
    
    def CleanBullet(bullet):
        this_df = pd.DataFrame()
        for n in range(len(bullet)):
            this_rank = bullet[n].find("span", class_="chart-element__rank").get_text().strip('\n').strip('\n').strip('Rising').strip('\n')
            this_song = bullet[n].find("span", class_="chart-element__information__song").get_text().strip('\n')
            this_artist = bullet[n].find("span", class_="chart-element__information__artist").get_text().strip('\n')
            this_last_week = bullet[n].find("span", class_="text--last").get_text().strip(' Last Week')
            this_peak = bullet[n].find("span", class_="text--peak").get_text().strip(' Peak Rank')
            this_weeks_on = bullet[n].find("span", class_="text--week").get_text().strip(' Weeks on Chart')
    
            
            data={
                'rank': this_rank,
                'song': this_song,
                'artist': this_artist,
                'last_week': this_last_week,
                'peak': this_peak,
                'weeks_on': this_weeks_on
            }
            this_df = this_df.append(data, ignore_index=True)
        return this_df
    
    
    base_url = "https://www.billboard.com/charts/hot-100/2021-10-30"
    response = requests.get(base_url,  verify = False)
    web_page = response.text
    soup = BeautifulSoup(web_page, "html.parser")    
    full_table = soup.find("ol", class_="chart-list__elements").find_all("li")
    
    df1 = CleanBullet(full_table)
    
    print(df1)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-01-21
      • 1970-01-01
      • 2016-08-19
      • 2017-05-01
      • 1970-01-01
      • 2014-09-24
      • 2021-04-05
      相关资源
      最近更新 更多