R 在特定子集中滞后？答案

【问题标题】：Lags in R within specific subsets?R 在特定子集中滞后？
【发布时间】：2015-07-17 15:44:04
【问题描述】：

假设我有以下数据框：

df <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4))

我试图在每个独特的州-县组合中创建失业滞后。我想结束这个：

df2 <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4),"unemp_lag"=c(NA,4.0,3.6,NA,3.7,6.5))

现在，想象一下这种情况，除了数以千计不同的县-州组合和几年。我尝试使用 lag 函数 zoo.lag 函数，但我无法让它考虑到州县代码。一种可能性是制作一个巨大的 for 循环，但我认为这是太多的数据（R 不能很好地处理 for 循环），我正在寻找一种更清洁的方法来做到这一点。有任何想法吗？谢谢！

【问题讨论】：

df$unemp_lag <- lag(df$unemp) 但您的数据样本仅包含县 3，很难想象按 county 分组。前面的代码应该添加到类似group_by(county)

标签： r time-series lag zoo

【解决方案1】：

带data.table：

library(data.table)
setDT(df)[,`:=`(unemp_lag1=shift(unemp,n=1L,fill=NA, type="lag")),by=.(state, county)][]

   yearmonth state county unemp unemp_lag1
1:   2005-01     1      3   4.0         NA
2:   2005-02     1      3   3.6        4.0
3:   2005-03     1      3   1.4        3.6
4:   2005-01     2      3   3.7         NA
5:   2005-02     2      3   6.5        3.7
6:   2005-03     2      3   5.4        6.5

【讨论】：

【解决方案2】：

只是一种老式的基础 R 方法：

dsp <- split(df, list(df$state, df$county) )
dsp <- lapply(dsp, function(x) transform(x, unemp_lag =lag(unemp)))
dsp <- unsplit(dsp, list(df$state, df$county))
dsp
yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5

编辑

我在我的解决方案中使用的lag 函数是dplyr 的lag（尽管直到 BlondedDust 评论我才意识到），这是一个 true 和真实的纯基础 R 解决方案（我希望）：

dsp <- split(df, list(df$state, df$county) )
dsp <- lapply(dsp, function(x) transform(x, unemp_lag = c(NA, unemp[1:length(unemp)-1]) ) )
dsp <- unsplit(dsp, list(df$state, df$county))
dsp
  yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5

【讨论】：

当我使用stats::lag 执行此操作时，我得到了不同的结果。您确定没有其他您无法识别的滞后功能吗？（加载 dplyr 后，我确实得到了这个结果。）而且我一直认为基本 lag 函数提供了奇怪的结果。
哦，是的！我一直是个傻瓜，你是对的。我在使用dplyr 的lag 吗？是否可以？ lag(unemp, -2) 是正确的代码吗？
@BondedDust 非常感谢。我用 true 基本 R 解决方案修复了它。

【解决方案3】：

与dplyr:

> library(dplyr)
> df %>% group_by(state, county) %>% mutate(unemp_lag=lag(unemp))
Source: local data frame [6 x 5]
Groups: state, county

   yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5

还有data.table:

> df2 <- as.data.table(df)
> df2[, unemp_lag := c(NA , unemp[-.N]), by=list(state, county)]

   yearmonth state county unemp unemp_lag
1:   2005-01     1      3   4.0        NA
2:   2005-02     1      3   3.6       4.0
3:   2005-03     1      3   1.4       3.6
4:   2005-01     2      3   3.7        NA
5:   2005-02     2      3   6.5       3.7
6:   2005-03     2      3   5.4       6.5

【讨论】：

太棒了！谢谢！还有一个问题——我试图从六个月前开始滞后。我尝试添加参数 k=-6 或 lags=6 但它不起作用。想法？
如果数据按月份排序并且包含连续月份：mutate(unemp_lag=lag(unemp, n=6))
@zero323 data.table 具有执行此操作的 shift 函数。请参阅我的解决方案。
排序正确，但看起来并不完整。有没有办法通过将 yearmonth 变成 Date 对象来做到这一点？
我现在能想到的唯一方法是添加缺失的行，但这是一种相当蛮力的方法。也许对时间序列有更多经验的人可以为您提供帮助。