【问题标题】:Regular rolling sum and mean常规滚动总和和平均值
【发布时间】:2017-10-18 19:06:14
【问题描述】:

我正在尝试使用lubridatedata.tabledplyr 创建一个我必须每季度运行一次的 R 脚本。我正在尝试尽可能多地自动化它,以便我可能只需要更改目录即可运行它。基本上,我的问题是我需要从另一个数据集(数据集A)创建一个数据集。这个数据集看起来像

      ID   fromdate     todate Quarters        Cost Location
  1:  29 2015-03-08 2015-03-25   2015Q1    13747.12  Orlando
  2:  29 2015-04-08 2015-04-08   2015Q2     1555.08    Miami
  3:  29 2015-07-08 2015-07-08   2015Q3      961.51    Miami
  4:  29 2015-09-23 2015-09-24   2015Q3     3492.00  Orlando
  5:  29 2015-09-24 2015-10-03   2015Q4     9948.56  Orlando
 ---                                                        
593: 174 2017-03-01 2017-03-31   2017Q1     2794.26  Orlando
594: 174 2017-04-05 2017-04-05   2017Q2      425.86    Miami
595: 174 2017-04-03 2017-04-28   2017Q2     2400.24  Orlando
596: 174 2017-05-01 2017-05-31   2017Q2     2805.46  Orlando
597: 174 2017-06-02 2017-06-30   2017Q2     2603.51  Orlando

IDs 之一的扩展是

    ID   fromdate     todate Quarters CLM_PMT_AMT Location
 1: 29 2015-03-08 2015-03-25   2015Q1    13747.12  Orlando
 2: 29 2015-04-08 2015-04-08   2015Q2     1555.08    Miami
 3: 29 2015-07-08 2015-07-08   2015Q3      961.51    Miami
 4: 29 2015-09-23 2015-09-24   2015Q3     3492.00  Orlando
 5: 29 2015-09-24 2015-10-03   2015Q4     9948.56  Orlando
 6: 29 2015-10-03 2015-10-03   2015Q4       39.33  Orlando
 7: 29 2015-10-05 2015-10-05   2015Q4      192.26    Miami
 8: 29 2015-10-11 2015-10-14   2015Q4     9478.80  Orlando
 9: 29 2015-10-15 2015-10-27   2015Q4    20655.46  Orlando
10: 29 2015-10-06 2015-10-31   2015Q4     1061.70  Orlando
11: 29 2015-11-03 2015-11-03   2015Q4      319.29  Orlando
12: 29 2015-11-05 2015-11-05   2015Q4      894.58    Miami
13: 29 2015-11-05 2015-11-28   2015Q4    21678.48  Orlando
14: 29 2015-12-06 2015-12-06   2015Q4      248.98    Miami
15: 29 2015-12-16 2015-12-25   2015Q4     9948.56  Orlando
16: 29 2015-12-01 2015-12-29   2015Q4     1417.91  Orlando
17: 29 2015-12-30 2016-01-01   2016Q1     9514.55  Orlando
18: 29 2016-01-05 2016-01-10   2016Q1     9682.28  Orlando
19: 29 2016-01-25 2016-01-27   2016Q1     6764.50  Orlando
20: 29 2016-01-03 2016-01-30   2016Q1     1564.87  Orlando
21: 29 2016-02-15 2016-02-17   2016Q1     3908.10  Orlando
22: 29 2016-02-02 2016-02-27   2016Q1     1886.87  Orlando
23: 29 2016-03-03 2016-03-03   2016Q1       76.58    Miami
24: 29 2016-03-03 2016-03-06   2016Q1     3213.78  Orlando
25: 29 2016-03-14 2016-03-23   2016Q1     4871.14  Orlando

我试图用这个数据集做的是按季度计算Cost 的总和和平均值,按滚动年份。例如,ID = 29Quarters = 2015Q4 将是从Quarters = 2015Q1Quarters = 2015Q4Cost 的总和和平均值,对于Quarters = 2016Q2,总和和平均值应该是从Quarters = 2015Q3Quarters = 2016Q2。这应该适用于每个ID、每个Location 和每个Quarter。我知道我可能不得不使用类似的东西

A %>% 
group_by(ID, Quarters, Location) %>%
...

但我遇到的问题是,并非所有Quarters 都代表每个ID。关于如何做到这一点的任何建议?我已经束手无策了!

【问题讨论】:

  • 使用zoo::rollmeanzoo::rollsum。如果您在 R 标签中搜索“[r] rolling mean”,这里会有很多问题,stackoverflow.com/search?q=%5Br%5D+rolling+mean
  • 而您不想按季度分组,您想按 ID 和位置分组,您需要滚动季度。

标签: r dplyr data.table lubridate


【解决方案1】:

您可以使用tidyr::complete 来添加缺少的宿舍。例如

library(tidyverse)
dt %>% 
  mutate(Quarters = as.factor(Quarters)) %>% 
  group_by(ID, Location, Quarters) %>% 
  summarise_if(is.numeric, funs(mean(., na.rm = TRUE))) %>% 
  complete(ID, Location, Quarters, fill=list(CLM_PMT_AMT=0)) %>% 
  mutate_if(is.numeric, funs(roll = zoo::rollmeanr(., k=4, na.pad = TRUE)))
# # A tibble: 10 x 5
# # Groups: ID, Location [2]
# ID Location Quarters CLM_PMT_AMT  roll
# <int> <chr>    <fctr>         <dbl> <dbl>
#   1    29 Miami    2015Q1           0      NA
# 2    29 Miami    2015Q2        1555      NA
# 3    29 Miami    2015Q3         962      NA
# 4    29 Miami    2015Q4         445     740
# 5    29 Miami    2016Q1          76.6   760
# 6    29 Orlando  2015Q1       13747      NA
# 7    29 Orlando  2015Q2           0      NA
# 8    29 Orlando  2015Q3        3492      NA
# 9    29 Orlando  2015Q4        8283    6381
# 10    29 Orlando  2016Q1        5176    4238

【讨论】:

  • 如果数据集中完全缺少 2015Q3,这是否有效? (不仅仅是缺少一个 ID)。
  • @Ben 否。但是,您可以提供所需的因子水平,例如 factor(c("2015Q1", "2015Q3"), levels = paste0(rep(2015:2016, each=4), "Q", 1:4))
【解决方案2】:

这个怎么样?

library(data.table)
library(mltools)

dt <- data.table(
  id = c(1, 1, 1, 1, 1,
         2, 2, 2, 2),
  somedate = as.Date(c("2014-2-1", "2014-2-28", "2014-9-30", "2014-12-11", "2015-5-15", 
                       "2014-8-11", "2015-6-30", "2015-6-30", "2015-12-1")),
  value = c(1, 2, 3, 4, 5,
            10, 20, 30, 40)
)
dt
   id   somedate value YearQuarter
1:  1 2014-02-01     1     2014 Q1
2:  1 2014-02-28     2     2014 Q1
3:  1 2014-09-30     3     2014 Q3
4:  1 2014-12-11     4     2014 Q4
5:  1 2015-05-15     5     2015 Q2
6:  2 2014-08-11    10     2014 Q3
7:  2 2015-06-30    20     2015 Q2
8:  2 2015-06-30    30     2015 Q2
9:  2 2015-12-01    40     2015 Q4

# Insert YearQuarter
dt[, YearQuarter := mltools::date_factor(somedate, type = "yearquarter")]

# Build table of all possible (id, YearQuarter) pairs based on the levels of dt$YearQuarter
temp <- CJ(id = unique(dt$id), YearQuarter = levels(dt$YearQuarter))

# Aggregate dt to unique (id, YearQuarter) pairs
dt_aggregated <- dt[, list(value_sum = sum(value)), keyby=list(id, YearQuarter)]

# Determine the value_sum in each quarter for each id, via join to temp
result <- dt_aggregated[temp, on=c("id", "YearQuarter")]
result[is.na(value_sum), value_sum := 0]

# Rolling sums by id
result[, RollingAnnualSum := Reduce(`+`, shift(x = value_sum, n = 0:3, fill = 0, type = "lag")), by="id"]

result
    id YearQuarter value_sum RollingAnnualSum
 1:  1     2014 Q1         3                3
 2:  1     2014 Q2         0                3
 3:  1     2014 Q3         3                6
 4:  1     2014 Q4         4               10
 5:  1     2015 Q1         0                7
 6:  1     2015 Q2         5               12
 7:  1     2015 Q3         0                9
 8:  1     2015 Q4         0                5
 9:  2     2014 Q1         0                0
10:  2     2014 Q2         0                0
11:  2     2014 Q3        10               10
12:  2     2014 Q4         0               10
13:  2     2015 Q1         0               10
14:  2     2015 Q2        50               60
15:  2     2015 Q3         0               50
16:  2     2015 Q4        40               90

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-03-12
    • 1970-01-01
    • 1970-01-01
    • 2016-06-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多