在python中抓取表值答案

【问题标题】：Scraping table values in python在python中抓取表值
【发布时间】：2018-07-02 18:57:31
【问题描述】：

SO 新手，在使用 beautifulsoup 从网站上抓取表格时遇到了一些困难。

表格的源 html 是这样的（每个艺术家/歌曲/专辑都重复令人作呕）：

<td class="subject">
    <p title="song">song</p>
    <p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
</td>

我正在尝试使用所有这些信息创建一个输出文件。我使用的代码是：

with open('output.txt', 'w', encoding='utf-8') as f:
for tr in soup.find_all('tr')[1:]:
    tds = tr.find_all('td')
    f.write("Information: %s" % tds[3].text)

这让我得到这样的输出：

Information: 
song
singer | album

如何将其更改为将所有内容放在一条线上，并正确分开？理想情况下，我的输出应该是这样的：

Song Title: song
Artist: singer
Album Name: album

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

您可以在BeautifulSoup 中使用正则表达式：

from bs4 import BeautifulSoup as soup
import re
s = """
<td class="subject">
<p title="song">song</p>
<p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
 </td>
"""
s = soup(s, 'lxml')

data = [list(filter(None, c))[0] for c in [re.findall('title="song">(.*?)</p>|album">(.*?)<span class="bar">|</span>(.*?)</p>', str(i)) for i in s.find_all('td', {'class':'subject'})][0]]
for i in zip(['Song', 'Artist', 'Album'], data):
   print('{}: {}'.format(*i))

输出：

Song: song
Artist: artist
Album: album

【讨论】：

【解决方案2】：

我认为你刚刚接近，你只需要处理tds的结果。我会做以下事情：

from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'lxml')

html = """<td class="subject">
    <p title="song">song</p>
    <p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
</td>"""

tds = b.find_all('td')
data = tds[0]

t = data.text.split('\n')
song = t[1]
artist_album = t[2].split('|')
artist = artist_album[0]
album = artist_album[1]
print("Song:", song)
print("Artist:", artist)
print("Album:", album)

这应该给你：

Song: song
Artist: artist
Album: album

【讨论】：