网络抓取的困难答案

【问题标题】：Difficulties with web scraping网络抓取的困难
【发布时间】：2022-01-21 17:21:21
【问题描述】：

我刚刚看到一篇名为 The 500 Greatest Songs of All Time 的文章，并想“哦，那太酷了，我敢打赌他们还制作了一个我可以关注的 Spotify/Apple 音乐列表”。嗯……他们没有。

简而言之，我想知道是否有可能 1) 废弃网站以提取歌曲，以及 2) 然后将某种批量上传到 Spotify 以创建列表。

网站中歌曲的标题和作者的结构如下： Website screenshot。我已经尝试使用谷歌表格中的 importxml() 公式报废网络，但没有成功。

我知道报废部分比其他部分更容易，并且由于我是编程新手，我很乐意设法部分实现这一目标。我相信这个任务可以在 python 上轻松完成。

【问题讨论】：

标签： html web-scraping xpath

【解决方案1】：

我觉得解释一切都超出了这里的范围，所以我试着把代码注释得足够好。

1。刮掉歌曲

我使用了 python3 和 selenium，他们的网站没有阻止它。如有必要，请务必调整您的 chromedriver 路径，以及底部的 .txt 文件的输出路径。完成后，您拥有 .txt 文件，您可以将其关闭。

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

s = Service(r'/Users/main/Desktop/chromedriver')
driver = webdriver.Chrome(service=s)

# just setting some vars, I used Xpath because I know that
top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
cookie_button_xpath = "// button [@id = 'onetrust-accept-btn-handler']"
div_containing_links_xpath = "// div [@id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
song_names_xpath = "// article [@class = 'c-gallery-vertical-album'] / child :: h2"

links = []
songs = []


driver.get(top_500)


# accept cookies, give time to load
time.sleep(3)
cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
cookie_btn.click()
time.sleep(1)


# extracting all the links since there are only 50 songs per page
links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)

for element in links_to_next_pages:
    l = element.get_attribute('href')
    links.append(l)


# extracting the songs, then going to next page and so on until we hit 500
counter = 1         # were starting with 1 here since links[0] is the current page we are already on

while True:
    list = driver.find_elements(By.XPATH, song_names_xpath)

    for element in list:
        s = element.text
        songs.append(s)
    
    if len(songs) == 500:
        break

    driver.get(links[counter])
    counter += 1

    time.sleep(2)


# verify that there are no duplicates, if there were, something would be off
if len(songs) != len( set(songs) ):
    print('you f***** up')
else:
    print('seems fine')


with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
    file.writelines(line + '\n' for line in songs)

2。准备 Spotify

转到Spotify Developer Dashboard 并创建一个帐户（使用您的 Spotify 帐户）。然后创建一个应用，随意命名。
在您的应用程序上单击设置和白名单http://localhost:8888/callback
在您的应用上点击“用户和访问”并添加您的 Spotify 帐户
让标签保持打开状态，我们会回来的

3。准备好你的环境

您需要Node.js，因此请确保已将其安装在您的机器上
从 Spotifys GitHub 下载 this
解压，cd 进入文件夹，运行npm install
进入 authorization_code 文件夹并在编辑器中打开 app.js
找到var scope并将'playlist-modify-public'附加到字符串中，这样您的应用就可以访问您的Spotify播放列表，请参阅here
现在回到您的Spotify Developer Dashboard 中的应用程序，我们需要将客户端 ID 和客户端密码分别复制到 var client_id 和 var client_secret 中（在 app.js 文件中）。 var redirect_uri 将是 http://localhost:8888/callback - 不要忘记保存您的更改。

4。运行 Spotify 方面

cd 进入 authorization_code 文件夹并使用 node app.js 运行 app.js（这基本上是在您的 PC 上运行的服务器）
现在，如果它运行正常并转到 http://localhost:8888，在那里授权您的 Spotify 帐户
那里复制完整的token，包括溢出，使用inspect元素来获取它
在下面的python脚本中调整user_id和auth变量以及output_songs.txt的路径（在打开时）并运行它，找不到的歌曲将最后打印出来，用谷歌搜索一下。它们通常也在 Spotify 上，但 Google 似乎有更好的搜索算法（惊讶的皮卡丘脸）。

import requests
import re
import json

# this is NOT you display name, it's your user name!!
user_id = 'YOUR_USERNAME'
# paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}


playlist = []
err_log = []
base_url = 'https://api.spotify.com/v1'
search_method = '/search'

with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
    songs = file.readlines()


# this querys spotify does some magic and then appends the tracks spotify uri to an array
def query_song_uris():
    for n, entry in enumerate(songs):
        x = re.findall(r"'([^']*)'", entry)
        title_len = len(entry) - len(x[0]) - 4
        
        title = x[0]
        artist = entry[:title_len]

        payload = {
            'q': (entry),
            'track:': (title),
            'artist:': (artist),
            'type': 'track',
            'limit': 1
        }

        url = base_url + search_method
        
        try:
            r = requests.get(url, params=payload, headers=auth)
            print('\nquerying spotify;  ', r)
            
            c = r.content.decode('UTF-8')
            dic = json.loads(c)

            track_uri = dic["tracks"]["items"][0]["uri"]

            playlist.append(track_uri)
            print(track_uri)

        except:
            err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
            err_log.append(err)

    playlist.reverse()
query_song_uris()

# creates a playlist and returns playlist id
def create_playlist():
    payload = {
                "name": "Rolling Stone: Top 500 (All Time)",
                "description": "music for old men xD with occasional hip hop appearences. just kidding"
            }

    url = base_url + f'/users/{user_id}/playlists'
    r = requests.post(url, headers=auth, json=payload)
    
    c = r.content.decode('UTF-8')
    dic = json.loads(c)

    print(f'\n\ncreating playlist @{dic["id"]};  ', r)
    return dic["id"]


def add_to_playlist():

    playlist_id = create_playlist()

    while True:

        if len(playlist) > 100:
            p = playlist[:100]
        else:
            p = playlist

        payload = {"uris": (p)}

        url = base_url + f'/playlists/{playlist_id}/tracks'
        r = requests.post(url, headers=auth, json=payload)

        print(f'\nadding {len(p)} songs to playlist;  ', r)

        del playlist[ : len(p) ]

        if len(playlist) == 0:
            break
add_to_playlist()


print('\n\ncheck your spotify :)')
print("\n\n\nthese tracks didn't make it, check manually:\n")
for line in err_log:
    print(line)
print('\n\n')

完成

如果您不想自己运行代码，播放列表如下： https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS

如果您遇到问题，从第 2 步开始的所有内容也会在 here in the Web API quick start 或一般在 the web API docs 中进行描述。

关于 Apple Music

所以苹果似乎非常封闭（惊喜哈哈）。但我发现您可以查询 i-Tunes 商店。给定的响应还包含指向 Apple Music 上歌曲的直接链接。你也许可以从那里去。

Get ISRC code from iTunes Search API (Apple music)

PS：不可否认，正则表达式是巫术，但你们都在这里支持我

【讨论】：

非常感谢您在回答中的所有时间和详细程度！