调试：为多列创建多个滞后的功能（dplyr）答案

【问题标题】：debugging: function to create multiple lags for multiple columns (dplyr)调试：为多列创建多个滞后的功能（dplyr）
【发布时间】：2016-06-30 09:36:23
【问题描述】：

我想创建多个变量的多个滞后，所以我认为编写一个函数会有所帮助。我的代码抛出警告（“将向量截断为长度 1”）和错误结果：

library(dplyr)
time <- c(2000:2009, 2000:2009)
x <- c(1:10, 10:19)
id <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
df <- data.frame(id, time, x)



three_lags <- function (data, column, group, ordervar) {
  data <- data %>% 
    group_by_(group) %>%
    mutate(a = lag(column, 1L, NA, order_by = ordervar),
            b = lag(column, 2L, NA, order_by = ordervar),
            c = lag(column, 3L, NA, order_by = ordervar)) 
  }

df_lags <- three_lags(data=df, column=x, group=id, ordervar=time) %>%
  arrange(id, time)

我还想知道使用mutate_each 是否有更优雅的解决方案，但我也没有得到它。我当然可以为每个新的滞后变量写一段长代码，但我想避免这种情况。

编辑：

akrun 的 dplyr 答案有效，但需要很长时间来计算大型数据帧。使用data.table 的解决方案似乎更有效。因此，仍然可以找到一个 dplyr 或其他解决方案，它还允许为多个列和多个滞后实现。

编辑 2：

对于多列且无组（例如“ID”），由于其简单性，以下解决方案似乎非常适合我。代码当然可以缩短，但是一步一步来：

df <- arrange(df, time)

df.lag <- shift(df[,1:24], n=1:3, give.names = T)  ##column indexes of columns to be lagged as "[,startcol:endcol]", "n=1:3" sepcifies the number of lags (lag1, lag2 and lag3 in this case)

df.result <- bind_cols(df, df.lag)

【问题讨论】：

完美运行！我只需要阅读data.table 才能正确操作它，并为像我这样不是非常熟练的程序员的其他人思考dplyr 解决方案更容易理解
我更新了data.table 解决方案，以防有很多列你想做shift

标签： r dplyr

【解决方案1】：

我们可以使用data.table中的shift，它可以为'n'取多个值

library(data.table)
setDT(df)[order(time), c("a", "b", "c") := shift(x, 1:3) , id][order(id, time)]

假设，我们需要在多个列上执行此操作

df$y <- df$x
setDT(df)[order(time), paste0(rep(c("x", "y"), each =3), 
                c("a", "b", "c")) :=shift(.SD, 1:3), id, .SDcols = x:y]

shift也可以用在dplyr中

library(dplyr)
df %>% 
  group_by(id) %>% 
  arrange(id, time) %>% 
  do(data.frame(., setNames(shift(.$x, 1:3), c("a", "b", "c"))))
#    id  time     x     a     b     c
#   <dbl> <int> <int> <int> <int> <int>
#1      1  2000     1    NA    NA    NA
#2      1  2001     2     1    NA    NA
#3      1  2002     3     2     1    NA
#4      1  2003     4     3     2     1
#5      1  2004     5     4     3     2
#6      1  2005     6     5     4     3
#7      1  2006     7     6     5     4
#8      1  2007     8     7     6     5
#9      1  2008     9     8     7     6
#10     1  2009    10     9     8     7
#11     2  2000    10    NA    NA    NA
#12     2  2001    11    10    NA    NA
#13     2  2002    12    11    10    NA
#14     2  2003    13    12    11    10
#15     2  2004    14    13    12    11
#16     2  2005    15    14    13    12
#17     2  2006    16    15    14    13
#18     2  2007    17    16    15    14
#19     2  2008    18    17    16    15
#20     2  2009    19    18    17    16

【讨论】：

谢谢你，这很有效，而且显然更有效率！我暂时保留这个问题
dplyr 代码生成 6 列而不是 3 列，尽管这具有为新列分配合理名称的优势
@yoland 它只提供 3 列。请检查您使用的是原始数据集还是data.table转换后的数据集。
@yoland := 在原始数据集中创建新列，但 %>% 不会更改原始数据集，除非我们将其分配给 df 即 df <- df %>%...

【解决方案2】：

还可以创建一个输出小标题的函数：

library(tidyverse)

lag_multiple <- function(x, n_vec){
  map(n_vec, lag, x = x) %>% 
    set_names(paste0("lag", n_vec)) %>% 
    as_tibble()
}

tibble(x = 1:30) %>% 
  mutate(lag_multiple(x, 1:5))
#> # A tibble: 30 x 6
#>        x  lag1  lag2  lag3  lag4  lag5
#>    <int> <int> <int> <int> <int> <int>
#>  1     1    NA    NA    NA    NA    NA
#>  2     2     1    NA    NA    NA    NA
#>  3     3     2     1    NA    NA    NA
#>  4     4     3     2     1    NA    NA
#>  5     5     4     3     2     1    NA
#>  6     6     5     4     3     2     1
#>  7     7     6     5     4     3     2
#>  8     8     7     6     5     4     3
#>  9     9     8     7     6     5     4
#> 10    10     9     8     7     6     5
#> # ... with 20 more rows

【讨论】：