【问题标题】:How to use R to scrape financials from Yahoo Finance如何使用 R 从 Yahoo Finance 抓取财务数据
【发布时间】:2018-04-15 18:05:43
【问题描述】:

我有兴趣使用 R 分析来自 Yahoo Finance 的多个股票代码的余额、收入和现金流量表。

我看到有从雅虎财经提取信息的 R 包,但我看到的所有示例都涉及历史股价信息。有没有办法可以使用 R 从这些语句中提取历史信息?

例如,对于 Apple (AAPL),可检索链接如下:

本质上,目标是创建三个数据框(AAPL_cashflowAAPL_incomeAAPL_balance),它们的模式与网站上的相同。每行由财务类型标识,列是日期。

有人有解析和抓取表格的经验吗?我认为rvest 可以帮助解决这个问题,对吧?

提前致谢!

【问题讨论】:

  • 到目前为止您尝试过什么?请将您编写的代码添加到您的问题中。

标签: r web-scraping finance


【解决方案1】:

使用来自 tidyverse 的一些软件包,这应该可以帮助您入门:

library(tidyverse)
library(rvest)

"https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>% 
  read_html() %>% 
  html_table() %>% 
  map_df(bind_cols) %>% 
  as_tibble()
# A tibble: 28 x 5
   X1                                 X2                 X3                 X4                 X5      
   <chr>                              <chr>              <chr>              <chr>              <chr>   
 1 Revenue                            9/30/2017          9/24/2016          9/26/2015          9/27/20…
 2 Total Revenue                      229,234,000        215,639,000        233,715,000        182,795…
 3 Cost of Revenue                    141,048,000        131,376,000        140,089,000        112,258…
 4 Gross Profit                       88,186,000         84,263,000         93,626,000         70,537,…
 5 Operating Expenses                 Operating Expenses Operating Expenses Operating Expenses Operati…
 6 Research Development               11,581,000         10,045,000         8,067,000          6,041,0…
 7 Selling General and Administrative 15,261,000         14,194,000         14,329,000         11,993,…
 8 Non Recurring                      -                  -                  -                  -       
 9 Others                             -                  -                  -                  -       
10 Total Operating Expenses           167,890,000        155,615,000        162,485,000        130,292…
# ... with 18 more rows

请注意,如果您想获取第一行并将其视为列名,请将header = TRUE 添加到html_table 调用中。例如,这将为您提供日期作为 finances 数据框中的列名。

此外,此数据框内有多个表,因此您需要对其进行整形才能使用数据。例如,var X2X5 当前应该是数字类型时是字符。

一个例子可能是:

finances <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>% 
  read_html() %>% 
  html_table(header = TRUE) %>% 
  map_df(bind_cols) %>% 
  as_tibble()

finances %>% 
  mutate_all(funs(str_replace_all(., ",", ""))) %>% 
  mutate_all(funs(str_replace(., "-", NA_character_))) %>%
  mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>% 
  mutate_at(vars(-Revenue), funs(as.numeric)) %>% 
  drop_na()
# A tibble: 14 x 5
   Revenue                                `9/30/2017` `9/24/2016` `9/26/2015` `9/27/2014`
   <chr>                                        <dbl>       <dbl>       <dbl>       <dbl>
 1 Total Revenue                           229234000.  215639000.  233715000.  182795000.
 2 Cost of Revenue                         141048000.  131376000.  140089000.  112258000.
 3 Gross Profit                             88186000.   84263000.   93626000.   70537000.
 4 Research Development                     11581000.   10045000.    8067000.    6041000.
 5 Selling General and Administrative       15261000.   14194000.   14329000.   11993000.
 6 Total Operating Expenses                167890000.  155615000.  162485000.  130292000.
 7 Operating Income or Loss                 61344000.   60024000.   71230000.   52503000.
 8 Total Other Income/Expenses Net           2745000.    1348000.    1285000.     980000.
 9 Earnings Before Interest and Taxes       61344000.   60024000.   71230000.   52503000.
10 Income Before Tax                        64089000.   61372000.   72515000.   53483000.
11 Income Tax Expense                       15738000.   15685000.   19121000.   13973000.
12 Net Income From Continuing Ops           48351000.   45687000.   53394000.   39510000.
13 Net Income                               48351000.   45687000.   53394000.   39510000.
14 Net Income Applicable To Common Shares   48351000.   45687000.   53394000.   39510000.

我们可以更进一步,使用gather 使数据框更加“整洁”:

finances %>% 
  mutate_all(funs(str_replace_all(., ",", ""))) %>% 
  mutate_all(funs(str_replace(., "-", NA_character_))) %>%
  mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>% 
  mutate_at(vars(-Revenue), funs(as.numeric)) %>% 
  drop_na() %>% 
  gather(key = "date", value, -Revenue) %>% 
  mutate(date = lubridate::mdy(date)) %>% 
  rename("var" = Revenue) %>% 
  as_tibble()
# A tibble: 56 x 3
   var                                date            value
   <chr>                              <date>          <dbl>
 1 Total Revenue                      2017-09-30 229234000.
 2 Cost of Revenue                    2017-09-30 141048000.
 3 Gross Profit                       2017-09-30  88186000.
 4 Research Development               2017-09-30  11581000.
 5 Selling General and Administrative 2017-09-30  15261000.
 6 Total Operating Expenses           2017-09-30 167890000.
 7 Operating Income or Loss           2017-09-30  61344000.
 8 Total Other Income/Expenses Net    2017-09-30   2745000.
 9 Earnings Before Interest and Taxes 2017-09-30  61344000.
10 Income Before Tax                  2017-09-30  64089000.
# ... with 46 more rows

【讨论】:

    【解决方案2】:

    以下代码似乎不再起作用,或者我使用不正确。

    finances <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>% 
      read_html() %>% 
      html_table() %>% 
      map_df(bind_cols) %>% 
      as_tibble()
    

    会将此作为评论,但不知道如何阻止评论中的代码。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-01-18
      • 2021-06-07
      • 1970-01-01
      • 2013-10-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-07
      相关资源
      最近更新 更多