【问题标题】:Scraping table values in python在python中抓取表值
【发布时间】:2018-07-02 18:57:31
【问题描述】:

SO 新手,在使用 beautifulsoup 从网站上抓取表格时遇到了一些困难。

表格的源 html 是这样的(每个艺术家/歌曲/专辑都重复令人作呕):

<td class="subject">
    <p title="song">song</p>
    <p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
</td>

我正在尝试使用所有这些信息创建一个输出文件。我使用的代码是:

with open('output.txt', 'w', encoding='utf-8') as f:
for tr in soup.find_all('tr')[1:]:
    tds = tr.find_all('td')
    f.write("Information: %s" % tds[3].text)

这让我得到这样的输出:

Information: 
song
singer | album

如何将其更改为将所有内容放在一条线上,并正确分开?理想情况下,我的输出应该是这样的:

Song Title: song
Artist: singer
Album Name: album

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    您可以在BeautifulSoup 中使用正则表达式:

    from bs4 import BeautifulSoup as soup
    import re
    s = """
    <td class="subject">
    <p title="song">song</p>
    <p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
     </td>
    """
    s = soup(s, 'lxml')
    
    data = [list(filter(None, c))[0] for c in [re.findall('title="song">(.*?)</p>|album">(.*?)<span class="bar">|</span>(.*?)</p>', str(i)) for i in s.find_all('td', {'class':'subject'})][0]]
    for i in zip(['Song', 'Artist', 'Album'], data):
       print('{}: {}'.format(*i))
    

    输出:

    Song: song
    Artist: artist
    Album: album
    

    【讨论】:

      【解决方案2】:

      我认为你刚刚接近,你只需要处理tds的结果。我会做以下事情:

      from bs4 import BeautifulSoup
      b = BeautifulSoup(html, 'lxml')
      
      html = """<td class="subject">
          <p title="song">song</p>
          <p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
      </td>"""
      
      tds = b.find_all('td')
      data = tds[0]
      
      t = data.text.split('\n')
      song = t[1]
      artist_album = t[2].split('|')
      artist = artist_album[0]
      album = artist_album[1]
      print("Song:", song)
      print("Artist:", artist)
      print("Album:", album)
      

      这应该给你:

      Song: song
      Artist: artist
      Album: album
      

      【讨论】:

        猜你喜欢
        • 2021-04-26
        • 1970-01-01
        • 2015-02-04
        • 2019-08-14
        • 1970-01-01
        • 2016-11-04
        • 1970-01-01
        • 2013-09-28
        • 1970-01-01
        相关资源
        最近更新 更多