【问题标题】:Difficulties with web scraping网络抓取的困难
【发布时间】:2022-01-21 17:21:21
【问题描述】:

我刚刚看到一篇名为 The 500 Greatest Songs of All Time 的文章,并想“哦,那太酷了,我敢打赌他们还制作了一个我可以关注的 Spotify/Apple 音乐列表”。嗯……他们没有。

简而言之,我想知道是否有可能 1) 废弃网站以提取歌曲,以及 2) 然后将某种批量上传到 Spotify 以创建列表。

网站中歌曲的标题和作者的结构如下: Website screenshot。我已经尝试使用谷歌表格中的 importxml() 公式报废网络,但没有成功。

我知道报废部分比其他部分更容易,并且由于我是编程新手,我很乐意设法部分实现这一目标。我相信这个任务可以在 python 上轻松完成。

【问题讨论】:

    标签: html web-scraping xpath


    【解决方案1】:

    我觉得解释一切都超出了这里的范围,所以我试着把代码注释得足够好。

    1。刮掉歌曲

    我使用了 python3 和 selenium,他们的网站没有阻止它。 如有必要,请务必调整您的 chromedriver 路径,以及底部的 .txt 文件的输出路径。完成后,您拥有 .txt 文件,您可以将其关闭。

    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.service import Service
    
    s = Service(r'/Users/main/Desktop/chromedriver')
    driver = webdriver.Chrome(service=s)
    
    # just setting some vars, I used Xpath because I know that
    top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
    cookie_button_xpath = "// button [@id = 'onetrust-accept-btn-handler']"
    div_containing_links_xpath = "// div [@id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
    song_names_xpath = "// article [@class = 'c-gallery-vertical-album'] / child :: h2"
    
    links = []
    songs = []
    
    
    driver.get(top_500)
    
    
    # accept cookies, give time to load
    time.sleep(3)
    cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
    cookie_btn.click()
    time.sleep(1)
    
    
    # extracting all the links since there are only 50 songs per page
    links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
    
    for element in links_to_next_pages:
        l = element.get_attribute('href')
        links.append(l)
    
    
    # extracting the songs, then going to next page and so on until we hit 500
    counter = 1         # were starting with 1 here since links[0] is the current page we are already on
    
    while True:
        list = driver.find_elements(By.XPATH, song_names_xpath)
    
        for element in list:
            s = element.text
            songs.append(s)
        
        if len(songs) == 500:
            break
    
        driver.get(links[counter])
        counter += 1
    
        time.sleep(2)
    
    
    # verify that there are no duplicates, if there were, something would be off
    if len(songs) != len( set(songs) ):
        print('you f***** up')
    else:
        print('seems fine')
    
    
    with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
        file.writelines(line + '\n' for line in songs)
    

    2。准备 Spotify

    • 转到Spotify Developer Dashboard 并创建一个 帐户(使用您的 Spotify 帐户)。 然后创建一个应用,随意命名。
    • 在您的应用程序上单击设置和白名单http://localhost:8888/callback
    • 在您的应用上点击“用户和访问”并添加您的 Spotify 帐户
    • 让标签保持打开状态,我们会回来的

    3。准备好你的环境

    • 您需要Node.js,因此请确保已将其安装在您的机器上

    • 从 Spotifys GitHub 下载 this

    • 解压,cd 进入文件夹,运行npm install

    • 进入 authorization_code 文件夹并在编辑器中打开 app.js

    • 找到var scope并将'playlist-modify-public'附加到字符串中,这样您的应用就可以访问您的Spotify播放列表,请参阅here

    • 现在回到您的Spotify Developer Dashboard 中的应用程序,我们需要将客户端 ID 和客户端密码分别复制到 var client_idvar client_secret 中(在 app.js 文件中)。 var redirect_uri 将是 http://localhost:8888/callback - 不要忘记保存您的更改。

    4。运行 Spotify 方面

    • cd 进入 authorization_code 文件夹并使用 node app.js 运行 app.js(这基本上是在您的 PC 上运行的服务器)

    • 现在,如果它运行正常并转到 http://localhost:8888,在那里授权您的 Spotify 帐户

    • 那里复制完整的token,包括溢出,使用inspect元素来获取它

    • 在下面的python脚本中调整user_idauth变量以及output_songs.txt的路径(在打开时)并运行它,找不到的歌曲将最后打印出来,用谷歌搜索一下。它们通常也在 Spotify 上,但 Google 似乎有更好的搜索算法(惊讶的皮卡丘脸)。

    import requests
    import re
    import json
    
    # this is NOT you display name, it's your user name!!
    user_id = 'YOUR_USERNAME'
    # paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
    auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
    
    
    playlist = []
    err_log = []
    base_url = 'https://api.spotify.com/v1'
    search_method = '/search'
    
    with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
        songs = file.readlines()
    
    
    # this querys spotify does some magic and then appends the tracks spotify uri to an array
    def query_song_uris():
        for n, entry in enumerate(songs):
            x = re.findall(r"'([^']*)'", entry)
            title_len = len(entry) - len(x[0]) - 4
            
            title = x[0]
            artist = entry[:title_len]
    
            payload = {
                'q': (entry),
                'track:': (title),
                'artist:': (artist),
                'type': 'track',
                'limit': 1
            }
    
            url = base_url + search_method
            
            try:
                r = requests.get(url, params=payload, headers=auth)
                print('\nquerying spotify;  ', r)
                
                c = r.content.decode('UTF-8')
                dic = json.loads(c)
    
                track_uri = dic["tracks"]["items"][0]["uri"]
    
                playlist.append(track_uri)
                print(track_uri)
    
            except:
                err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
                err_log.append(err)
    
        playlist.reverse()
    query_song_uris()
    
    # creates a playlist and returns playlist id
    def create_playlist():
        payload = {
                    "name": "Rolling Stone: Top 500 (All Time)",
                    "description": "music for old men xD with occasional hip hop appearences. just kidding"
                }
    
        url = base_url + f'/users/{user_id}/playlists'
        r = requests.post(url, headers=auth, json=payload)
        
        c = r.content.decode('UTF-8')
        dic = json.loads(c)
    
        print(f'\n\ncreating playlist @{dic["id"]};  ', r)
        return dic["id"]
    
    
    def add_to_playlist():
    
        playlist_id = create_playlist()
    
        while True:
    
            if len(playlist) > 100:
                p = playlist[:100]
            else:
                p = playlist
    
            payload = {"uris": (p)}
    
            url = base_url + f'/playlists/{playlist_id}/tracks'
            r = requests.post(url, headers=auth, json=payload)
    
            print(f'\nadding {len(p)} songs to playlist;  ', r)
    
            del playlist[ : len(p) ]
    
            if len(playlist) == 0:
                break
    add_to_playlist()
    
    
    print('\n\ncheck your spotify :)')
    print("\n\n\nthese tracks didn't make it, check manually:\n")
    for line in err_log:
        print(line)
    print('\n\n')
    

    完成

    如果您不想自己运行代码,播放列表如下: https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS

    如果您遇到问题,从第 2 步开始的所有内容也会在 here in the Web API quick start 或一般在 the web API docs 中进行描述。

    关于 Apple Music

    所以苹果似乎非常封闭(惊喜哈哈)。但我发现您可以查询 i-Tunes 商店。给定的响应还包含指向 Apple Music 上歌曲的直接链接。 你也许可以从那里去。

    Get ISRC code from iTunes Search API (Apple music)

    PS:不可否认,正则表达式是巫术,但你们都在这里支持我

    【讨论】:

    • 非常感谢您在回答中的所有时间和详细程度!
    猜你喜欢
    • 2020-05-04
    • 2020-11-10
    • 1970-01-01
    • 2016-04-03
    • 2021-06-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-10-15
    相关资源
    最近更新 更多