用时间序列或同一列中的相邻值替换 NA 值 - data.table 方法答案

【问题标题】：Replace NA values with adjacent value in the time series or in the same column - data.table method用时间序列或同一列中的相邻值替换 NA 值 - data.table 方法
【发布时间】：2016-02-07 07:45:38
【问题描述】：

样本数据

df <- data.frame(id=c("A","A","A","A","B","B","B","B"),year=c(2014,2014,2015,2015),month=c(1,2),marketcap=c(4,6,2,6,23,2,5,34),return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))

df1
   id year month marketcap return
1:  A 2014     1         4     NA
2:  A 2014     2         6   0.23
3:  A 2015     1         2   0.20
4:  A 2015     2         6   0.10
5:  B 2014     1        23   0.40
6:  B 2014     2         2   0.90
7:  B 2015     1         5     NA
8:  B 2015     2        34   0.60

所需数据

desired_df <- data.frame(id=c("A","A","A","A","B","B","B","B"),year=c(2014,2014,2015,2015),month=c(1,2),marketcap=c(4,6,2,6,23,2,5,34),return=c(0.23,0.23,0.2,0.1,0.4,0.9,0.75,0.6))

desired_df
  id year month marketcap return
1  A 2014     1         4   0.23
2  A 2014     2         6   0.23
3  A 2015     1         2   0.20
4  A 2015     2         6   0.10
5  B 2014     1        23   0.40
6  B 2014     2         2   0.90
7  B 2015     1         5   0.75
8  B 2015     2        34   0.60

我想通过 id 将 NA 值替换为时间序列中的相邻值来插入返回值。假设只有两个月：一年有 1,2 个月。 (B,2015,1) 的第二个NA 替换为 0.75 =(0.9+0.6)/2 (A,2014,1) 的第一个 NA 被 0.23 替换，因为没有以前的数据。

如果可能，首选 data.table 解决方案

更新：当使用如下代码结构时（适用于示例）

df[,returnInterpolate:=na.approx(return,rule=2), by=id]

我遇到了错误：近似错误(x[!na], y[!na], xout, ...) ：至少需要两个非 NA 值进行插值

我猜可能有一些 id 没有要插入的非 NA 值。。有什么建议？

【问题讨论】：

library(zoo); help("na.approx")
亲爱的 Roland，如何用 by 进行 na.approx？我想通过 id 进行插值。顺便说一句，我刚刚编辑了问题，我也在寻找 data.table 解决方案以了解更多语法
在 approx(x[!na], y[!na], xout, ...) 中出现错误：需要至少两个非 NA 值进行插值 --- 这意味着少于您要应用该方法的系列中的两个非 NA 值 - 然后插值将不起作用

标签： r data.table interpolation na missing-data

【解决方案1】：

library(data.table)
df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
                 year=c(2014,2014,2015,2015),
                 month=c(1,2),
                 marketcap=c(4,6,2,6,23,2,5,34),
                 return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
setDT(df)
library(zoo)
df[, returnInterpol := na.approx(return, rule = 2), by = id]
#   id year month marketcap return returnInterpol
#1:  A 2014     1         4     NA           0.23
#2:  A 2014     2         6   0.23           0.23
#3:  A 2015     1         2   0.20           0.20
#4:  A 2015     2         6   0.10           0.10
#5:  B 2014     1        23   0.40           0.40
#6:  B 2014     2         2   0.90           0.90
#7:  B 2015     1         5     NA           0.75
#8:  B 2015     2        34   0.60           0.60

编辑：

如果您的群组只有NA 值或只有一个非NA，您可以这样做：

df <- data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C"),
                 year=c(2014,2014,2015,2015),
                 month=c(1,2),
                 marketcap=c(4,6,2,6,23,2,5,34, 1:4),
                 return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6,NA,NA,0.3,NA))
setDT(df)
df[, returnInterpol := switch(as.character(sum(!is.na(return))),
                              "0" = return,
                              "1" = {na.omit(return)},  
                              na.approx(return, rule = 2)), by = id]

#     id year month marketcap return returnInterpol
#  1:  A 2014     1         4     NA           0.23
#  2:  A 2014     2         6   0.23           0.23
#  3:  A 2015     1         2   0.20           0.20
#  4:  A 2015     2         6   0.10           0.10
#  5:  B 2014     1        23   0.40           0.40
#  6:  B 2014     2         2   0.90           0.90
#  7:  B 2015     1         5     NA           0.75
#  8:  B 2015     2        34   0.60           0.60
#  9:  C 2014     1         1     NA           0.30
# 10:  C 2014     2         2     NA           0.30
# 11:  C 2015     1         3   0.30           0.30
# 12:  C 2015     2         4     NA           0.30

【讨论】：

你看到我把它标记为 Pascal 了吗？亲爱的罗兰，我在使用你上面的建议时又更新了一个问题，请看看
@Pascal，vote +1 和 mark 有什么区别？谢谢
@PhamCongMinh 绿色标记向将来遇到类似问题的读者表明，他们可以使用这个答案来解决他们的问题，就像它解决了你的问题一样。
@Roland，现在我添加一个条件来过滤掉 NA 值，以绕过“需要至少两个非 NA 值进行插值”的错误：df[is.na(return) =TRUE,returnInterpol := na.approx(return, rule = 2), by = id]

【解决方案2】：

无需关心 ID 的简单 imputeTS 解决方案是：

library("imputeTS")
na.interpolate(df)

由于应该根据 ID 进行插补，所以它有点复杂 - 因为在按 ID 过滤时似乎经常没有足够的值。我会采用 Roland 发布的解决方案并在可能的情况下使用 imputeTS::na.interpolation()，在其他情况下，可能使用 imputeTS::na.mean() 的整体平均值或对整体边界 imputeTS::na.random() 的随机猜测。

在这种情况下，超越单变量时间序列插值/插补可能也是一个好主意。还有很多其他变量可以帮助估计缺失值（如果存在相关性）。像 AMELIA 这样的包可以在这里提供帮助。

【讨论】：