Rap Genius w/ Python 上的 Web Scraping Rap 歌词答案

【问题标题】：Web Scraping Rap lyrics on Rap Genius w/ PythonRap Genius w/ Python 上的 Web Scraping Rap 歌词
【发布时间】：2014-09-12 11:17:30
【问题描述】：

我有点编码新手，我一直在尝试使用 Beautiful Soup（用于从 HTML 和 XML 文件中提取数据的 Python 库）从说唱天才http://genius.com/artists/Andre-3000 中抓取 Andre 3000 的歌词。我的最终目标是让数据采用字符串格式。这是我目前所拥有的：

from bs4 import BeautifulSoup
from urllib2 import urlopen

artist_url = "http://rapgenius.com/artists/Andre-3000"

def get_song_links(url):
    html = urlopen(url).read()
    # print html 
    soup = BeautifulSoup(html, "lxml")
    container = soup.find("div", "container")
    song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]

    print song_links

get_song_links(artist_url)
for link in soup.find_all('a'):
    print(link.get('href'))

所以我需要其他代码方面的帮助。如何将他的歌词转换为字符串格式？然后我如何使用自然语言工具包 (NLTK) 来标记句子和单词。

【问题讨论】：

这是一个很棒的主意。你会生成新的歌词对吗？我正在考虑为 Tupac 做同样的事情。必须有一个工具可以根据他们现有的歌曲生成他们的声音。我的意思是，如果它是单词级别的，那么所有新生成的歌词都包含艺术家以前唱过的单词，因此需要对声波进行采样和扭曲，以使生成的声音听起来像你想要的那样。

标签： python web-scraping beautifulsoup html-parsing nltk

【解决方案1】：

即使你可以抓取网站，并不意味着你应该，而是你可以使用天才的API，只需从Genius API site创建访问令牌

import lyricsgenius as genius #calling the API
api=genius.Genius('youraccesstokenhere12345678901234567890isreallylongiknow')
artist=api.search_artist('The artist name here')
aux=artist.save_lyrics(format='json', filename='artist.txt',overwrite=True, skip_duplicates=True,verbose=True)#you can change parameters acording to your needs,i dont recommend using this file directly because it saves a lot of data that you might not need and will take more time to clean it

titles=[song['title'] for song in aux['songs']]#in this case for example i just want title and lyrics
lyrics=[song['lyrics'] for song in aux['songs']]
thingstosave=[]
for i in range(0,128):
    thingstosave.append(titles[i])
    thingstosave.append(lyrics[i])
with open("C:/whateverfolder/alllyrics.txt","w") as output:
    output.write(str(thingstosave))

【讨论】：

【解决方案2】：

GitHub / jashanj0tsingh / LyricsScraper.py 提供基本的从genius.com 上抓取歌词到文本文件中的功能，其中每一行代表一首歌曲。它将艺术家的名字作为输入。然后可以轻松地将生成的文本文件提供给您的自定义 nltk 或通用解析器以执行您想要的操作。

代码如下：

# A simple script to scrape lyrics from the genius.com based on atrtist name.

import re
import requests
import time
import codecs

from bs4 import BeautifulSoup
from selenium import webdriver

mybrowser = webdriver.Chrome("path\to\chromedriver\binary") # Browser and path to Web driver you wish to automate your tests cases.

user_input = input("Enter Artist Name = ").replace(" ","+") # User_Input = Artist Name
base_url = "https://genius.com/search?q="+user_input # Append User_Input to search query
mybrowser.get(base_url) # Open in browser

t_sec = time.time() + 60*20 # seconds*minutes
while(time.time()<t_sec): # Reach the bottom of the page as per time for now TODO: Better condition to check end of page.
    mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    html = mybrowser.page_source
    soup = BeautifulSoup(html, "html.parser")
    time.sleep(5)

pattern = re.compile("[\S]+-lyrics$") # Filter http links that end with "lyrics".
pattern2 = re.compile("\[(.*?)\]") # Remove unnecessary text from the lyrics such as [Intro], [Chorus] etc..

with codecs.open('lyrics.txt','a','utf-8-sig') as myfile:
    for link in soup.find_all('a',href=True):
            if pattern.match(link['href']):
                f = requests.get(link['href'])
                lyricsoup = BeautifulSoup(f.content,"html.parser")
                #lyrics = lyricsoup.find("lyrics").get_text().replace("\n","") # Each song in one line.
                lyrics = lyricsoup.find("lyrics").get_text() # Line by Line
                lyrics = re.sub(pattern2, "", lyrics)
                myfile.write(lyrics+"\n")
mybrowser.close()
myfile.close()

【讨论】：

Try adding some context to your answer
这只是挂在艺术家的页面上。

【解决方案3】：

希望这仍然是相关的！我正在对 Eminem 的歌词做同样的事情，但来自于 Lyrics.com。它必须来自 Rap Genius 吗？我发现 Lyrics.com 更容易抓取。

要获得 Andre 3000，只需相应地更改代码。

这是我的代码；它获取歌曲链接，然后抓取这些页面以获取歌词并将歌词附加到列表中：

import re
import requests
import nltk
from bs4 import BeautifulSoup

url = 'http://www.lyrics.com/eminem'
r = requests.get(url)
soup = BeautifulSoup(r.content)
gdata = soup.find_all('div',{'class':'row'})

eminemLyrics = []

for item in gdata:
    title = item.find_all('a',{'itemprop':'name'})[0].text
    lyricsdotcom = 'http://www.lyrics.com'
    for link in item('a'):
        try:
            lyriclink = lyricsdotcom+link.get('href')
            req = requests.get(lyriclink)
            lyricsoup = BeautifulSoup(req.content)
            lyricdata = lyricsoup.find_all('div',{'id':re.compile('lyric_space|lyrics')})[0].text
            eminemLyrics.append([title,lyricdata])
            print title
            print lyricdata
            print
        except:
            pass

这将为您提供列表中的歌词。打印所有标题：

titles = [i[0] for i in eminemLyrics]
print titles

获取特定歌曲：

titles.index('Cleaning out My Closet')
120

要标记歌曲，请将值 (120) 插入：

song = nltk.word_tokenize(eminemLyrics[120][1])
nltk.pos_tag(song)

【讨论】：

【解决方案4】：

这里有一个例子，如何抓取页面上的所有歌曲链接，关注它们并获取歌词：

from urlparse import urljoin
from bs4 import BeautifulSoup
import requests


BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"

response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
    link = urljoin(BASE_URL, song_link['href'])
    response = requests.get(link)
    soup = BeautifulSoup(response.text)
    lyrics = soup.find('div', class_='lyrics').text.strip()

    # tokenize `lyrics` with nltk

注意这里使用了requests 模块。另请注意，User-Agent 标头是必需的，因为该站点在没有它的情况下返回 403 - Forbidden。

【讨论】：

这很好，但是当我尝试运行它时出现此错误“ImportError: No module named bs4”
@Ibrewster 你需要安装beautifulsoup4：运行pip install beautifulsoup4。
是的，我已经安装了 bs4，但它无法正常工作。所以我尝试重新安装它，它仍然无法正常工作。
因为您使用的是 Python 3。在这种情况下使用 pip3 install beautifulsoup4
我将它复制粘贴到一个 jupyter 单元格中，当我运行它时，似乎什么也没发生。

【解决方案5】：

首先，对于每个链接，您需要下载该页面并使用 BeautifulSoup 对其进行解析。然后在该页面上寻找将歌词与其他页面内容区分开来的区别属性。我发现很有帮助。然后在歌词页面内容上运行 .find_all 以获取所有歌词行。对于每一行，您可以调用 .get_text() 以从每行歌词中获取文本。

至于 NLTK，一旦安装，您就可以像这样导入它并解析句子：

from nltk.tokenize import word_tokenize, sent_tokenize
words = [word_tokenize(t) for t in sent_tokenize(lyric_text)]

这将为您提供每个句子中所有单词的列表。

【讨论】：