【问题标题】:webscraping with beautiful soup用 beautifulsoup 抓取网页
【发布时间】:2020-11-03 17:59:28
【问题描述】:

我正在尝试从包含当前利率的网站上抓取表格。我用 python 和漂亮的汤,但我找不到 html 部分。请发送帮助!谢谢。

我只需要抓取当前利率表,而不是其他所有内容并将其转换为 csv 文件。这是我网站的链接:https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx 这是当前利率表的图片:

我尝试过这样的事情:

import bs4 
import requests
from bs4 import BeautifulSoup
import pandas as pd 

URL = 'https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx' 

response = requests.get(URL)
soup=bs4.BeautifulSoup(response.content, 'html.parser')

print(soup.title)
print(soup.title.string)
print(len(response.text))

table = soup.find('table', attrs = {'class':'tableheader'}).tbody
print(table)

columns = ['Current interest rates']
df = pd.DataFrame(columns = columns)

trs = table.find_all('tr')
for tr in trs:
    tds = tr.find_all('td')
    row = [td.text.replace('\n', '') for td in tds]
    df = df.append(pd.Series(row, index = columns), ignore_index = True)
df.to_csv('libor.csv', index = False)


但这给了我属性错误:“None Type'对象没有属性'tbody'

哦,如果可能的话,我还想自动抓取星期一的利率。 谢谢你的帮助

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    这是我对熊猫的尝试

    import pandas as pd
    
    # Get all tables on page
    dfs = pd.read_html('https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx')
    
    # Find the Current interest rates table
    df = [df for df in dfs if df.iloc[0][0] == 'Current interest rates'][0]
    
    # Remove first row that contains column names
    df = df.iloc[1:].copy()
    
    # Set column names
    df.columns = ['DATE','INTEREST_RATE']
    
    # Convert date from november 02 2020 to 2020-11-02
    df['DATE'] = pd.to_datetime(df['DATE'])
    
    # Remove percentage sign from interest rate
    df['INTEREST_RATE'] = df['INTEREST_RATE'].str.replace('%','').str.strip()
    
    # Convert percentage to float type
    df['INTEREST_RATE'] = df['INTEREST_RATE'].astype(float)
    
    # Add day of the week column
    df['DAY'] = df['DATE'].dt.day_name()
    
    # Output all to CSV
    df.to_csv('all_data.csv', index=False)
    
    # Only Mondays
    df_monday = df[df['DAY'] == 'Monday']
    
    # Output only Mondays
    df_monday.to_csv('monday_data.csv', index=False)
    
    # Add day number of week (Monday = 0)
    df['DAY_OF_WEEK_NUMBER'] = df['DATE'].dt.dayofweek
    
    # Add week number of year
    df['WEEK_OF_YEAR_NUMBER'] = df['DATE'].dt.weekofyear
    
    # 1. Sort by week of year then day of week
    # 2. Group by week of year
    # 3. Select first record in group, which will be the earliest day available of that week
    df_first_day_of_week = df.sort_values(['WEEK_OF_YEAR_NUMBER','DAY_OF_WEEK_NUMBER']).groupby('WEEK_OF_YEAR_NUMBER').first()
    
    # # Output earliest day of the week data
    df_first_day_of_week.to_csv('first_day_of_week.csv', index=False)
    
    

    【讨论】:

      【解决方案2】:

      您可以使用此示例来抓取“当前利率”:

      import requests
      import pandas as pd
      from bs4 import BeautifulSoup
      
      
      url = 'https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx'
      soup = BeautifulSoup(requests.get(url).content, 'html.parser')
      
      all_data = []
      for row in soup.select('table:has(td:contains("Current interest rates"))[style="width:208px;border:1px solid #CCCCCC;"] tr:not(:has([colspan]))'):
          tds = [td.get_text(strip=True) for td in row.select('td')]
          all_data.append(tds)
      
      df = pd.DataFrame(all_data, columns=['Date', 'Rate'])
      print(df)
      df.to_csv('data.csv', index=False)
      

      打印:

                      Date       Rate
      0   november 02 2020  0.33238 %
      1    october 30 2020  0.33013 %
      2    october 29 2020  0.33100 %
      3    october 28 2020  0.32763 %
      4    october 27 2020  0.33175 %
      5    october 26 2020  0.33200 %
      6    october 23 2020  0.33663 %
      7    october 22 2020  0.33513 %
      8    october 21 2020  0.33488 %
      9    october 20 2020  0.33713 %
      10   october 19 2020  0.33975 %
      11   october 16 2020  0.33500 %
      

      并保存data.csv


      编辑:要仅获得星期一,您可以使用数据框执行此操作:

      df['Date'] = pd.to_datetime(df['Date'])
      print(df[df['Date'].dt.weekday==0])
      

      打印:

               Date       Rate
      0  2020-11-02  0.33238 %
      5  2020-10-26  0.33200 %
      10 2020-10-19  0.33975 %
      

      【讨论】:

      • 非常感谢安德烈!这看起来太棒了!是否可以让它只获取星期一的数据?
      • 它似乎仍然保存为整个表格,尽管我可以看到它只在星期一打印出来。有没有办法让它只在星期一保存为 csv 文件?
      • @Anna 你可以做df[df['Date'].dt.weekday==0].to_csv('mondays.csv', index=False)
      • 最后一个快速的问题,是否有可能,给我所有的星期一,但如果星期一是假期(不存在)给我那一周的星期二?
      • @Anna 我已将一周的第一天智能添加到我的回答中。简而言之,您要添加一年中的周数和周数,然后按两者排序,按周数分组并选择第一条记录。这将为您提供该周最早可用的时间。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-09-17
      • 1970-01-01
      • 2018-04-25
      • 2014-06-20
      相关资源
      最近更新 更多