Python BeautifulSoup Scraper - 将函数应用于页面上 <ol> 中的每个 <li> 元素答案

【问题标题】：Python BeautifulSoup Scraper - apply function to each <li> element in <ol> on pagePython BeautifulSoup Scraper - 将函数应用于页面上 <ol> 中的每个 <li> 元素
【发布时间】：2021-10-26 22:18:41
【问题描述】：

我们正在抓取 Billboard 的热门 100 名单 https://www.billboard.com/charts/hot-100/2021-10-30 并且有一些不错的代码，但很难完成：

from bs4 import BeautifulSoup 
import requests
import pandas as pd

def CleanBullet(bullet):
    this_rank = all_bullets[0].find("span", class_="chart-element__rank").get_text().strip('\n').strip('\n').strip('Rising')
    this_song = all_bullets[0].find("span", class_="chart-element__information__song").get_text().strip('\n')
    this_artist = all_bullets[0].find("span", class_="chart-element__information__artist").get_text().strip('\n')
    this_last_week = all_bullets[0].find("span", class_="text--last").get_text().strip(' Last Week')
    this_peak = all_bullets[0].find("span", class_="text--peak").get_text().strip(' Peak Rank')
    this_weeks_on = all_bullets[0].find("span", class_="text--week").get_text().strip(' Weeks on Chart')

    this_df = pd.DataFrame()
    data={
        'rank': this_rank,
        'song': this_song,
        'artist': this_artist,
        'last_week': this_last_week,
        'peak': this_peak,
        'weeks_on': this_weeks_on
    }
    this_df = this_df.append(data, ignore_index=True)
    return this_df


base_url = "https://www.billboard.com/charts/hot-100/2021-10-30"
response = requests.get(base_url)
web_page = response.text
soup = BeautifulSoup(web_page, "html.parser")    
full_table = soup.find("ol", class_="chart-list__elements").find_all("li")

df1 = CleanBullet(full_table[0])
df1

我们怎么做：

对full_table 中的 100 个元素中的每一个元素应用此函数，从而生成一个包含 100 行的数据框？
删除排名列中的\n，因为strip('\n') 似乎不起作用...

【问题讨论】：

strip() 默认删除所有空格（包括换行符）
strip() 在上面的例子中没有像我需要的那样删除\n 换行符。

标签： python pandas beautifulsoup

【解决方案1】：

我可能只是从包含所有这些信息的 JavaScript 对象中提取所有数据。使用列表推导生成字典列表并转换为 df。

import requests
import pandas as pd
import re
import json
import html

r = requests.get('https://www.billboard.com/charts/hot-100/2021-10-30')
data = json.loads(html.unescape(re.search(r'data-charts="(.*)"',
                  r.text).group(1)))
df = pd.DataFrame(
    [{'rank': i['rank'],
     'song': i['title'],
     'artist': i['artist_name'],
     'last_week': str(i['history']['last_week']).split('.')[0],
     'peak': i['history']['peak_rank'],
     'weeks_on': i['history']['weeks_on_chart']} for i in data]
)
# df.to_csv('top100.csv', index = False)

旁注：

我很想知道为什么以下内容仍然可以正常工作：

df = pd.DataFrame(
        {'rank': i['rank'],
         'song': i['title'],
         'artist': i['artist_name'],
         'last_week': str(i['history']['last_week']).split('.')[0],
         'peak': i['history']['peak_rank'],
         'weeks_on': i['history']['weeks_on_chart']} for i in data
    )

我假设正在进行某种非常低效的复制。

【讨论】：

【解决方案2】：

这个怎么样？

from bs4 import BeautifulSoup 
import requests
import pandas as pd

def CleanBullet(bullet):
    this_df = pd.DataFrame()
    for n in range(len(bullet)):
        this_rank = bullet[n].find("span", class_="chart-element__rank").get_text().strip('\n').strip('\n').strip('Rising').strip('\n')
        this_song = bullet[n].find("span", class_="chart-element__information__song").get_text().strip('\n')
        this_artist = bullet[n].find("span", class_="chart-element__information__artist").get_text().strip('\n')
        this_last_week = bullet[n].find("span", class_="text--last").get_text().strip(' Last Week')
        this_peak = bullet[n].find("span", class_="text--peak").get_text().strip(' Peak Rank')
        this_weeks_on = bullet[n].find("span", class_="text--week").get_text().strip(' Weeks on Chart')

        
        data={
            'rank': this_rank,
            'song': this_song,
            'artist': this_artist,
            'last_week': this_last_week,
            'peak': this_peak,
            'weeks_on': this_weeks_on
        }
        this_df = this_df.append(data, ignore_index=True)
    return this_df


base_url = "https://www.billboard.com/charts/hot-100/2021-10-30"
response = requests.get(base_url,  verify = False)
web_page = response.text
soup = BeautifulSoup(web_page, "html.parser")    
full_table = soup.find("ol", class_="chart-list__elements").find_all("li")

df1 = CleanBullet(full_table)

print(df1)

【讨论】：