【问题标题】:How to scrape icons from an HTML table using Beautiful Soup如何使用 Beautiful Soup 从 HTML 表格中抓取图标
【发布时间】:2022-01-03 13:54:19
【问题描述】:

我正在尝试在markets.ft 网站上抓取一张表格,不幸的是其中有许多图标(表格:'Lipper Leader Scorecard' - https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR)。

当我使用 BeautifulSoup 时,我可以抓取表格,但所有值都是 NaN。

有没有办法把表格里面的图标刮下来转成数字?

我的代码是:

import requests
import pandas as pd
from bs4 import BeautifulSoup

id_list = ['LU0526609390:EUR','IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR', 'LU1116896876:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/ratings?s='+ x for x in id_list]

dfs =[]
for url in urls:
    r = requests.get(url).content
    soup = BeautifulSoup(r, 'html.parser')
    # Some funds in the list do not have any data.
    try:
     table = soup.find_all('table')[0]
     print(table)
    except Exception:
        continue 
    df = pd.read_html(str(table), index_col=0)[0]
    dfs.append(df)

print(dfs)

基金所需的输出 (LU0526609390):

                Total return  Consistent return  Preservation  Expense
Overall rating           3                3           5            5
3 year rating            3                3           5            5
5 year rating            2                3           5            5

【问题讨论】:

    标签: python pandas web-scraping beautifulsoup icons


    【解决方案1】:

    您可以使用字典将类值转换为对应的整数

    import requests, bs4
    import pandas as pd
    from io import StringIO
    
    options = {
        'mod-sprite-lipper-1': 1,
        'mod-sprite-lipper-2': 2,
        'mod-sprite-lipper-3': 3,
        'mod-sprite-lipper-4': 4,
        'mod-sprite-lipper-5': 5,
    }
    
    soup = bs4.BeautifulSoup(requests.get(
        url= 'https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR'
    ).content, 'html.parser').find('table', {'class': 'mod-ui-table'})
    
    
    header = [x.text.strip() for x in soup.find('thead').find_all('th')]
    
    data = [header] + [
        [x.find('td').text.strip()] + [
            options[e.find('i') .get('class')[-1]]
            for e in x.find_all('td')[1:]
        ]
        for x in soup.find('tbody').find_all('tr')
    ]
    
    df = pd.read_csv(
            StringIO('\n'.join([','.join(str(x) for x in xs) for xs in data])),
            index_col = 0,
            )
    
    print(df)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-12-15
      • 1970-01-01
      • 1970-01-01
      • 2018-07-01
      • 1970-01-01
      • 2015-04-07
      相关资源
      最近更新 更多