在 R 中使用膨胀扩展数据框列答案

【问题标题】：Extend data frame column with inflation in R在 R 中使用膨胀扩展数据框列
【发布时间】：2018-07-12 20:49:02
【问题描述】：

我正在尝试扩展一些代码以便能够： 1) 读入价格向量 2）将价格向量左连接到年（或年和月）的数据框 3) 使用基于可用价格的最后一年加上指定通货膨胀率的插值数据附加/填充缺失年份的价格。考虑一个这样的例子：

prices <- data.frame(year=2018:2022,
                wti=c(75,80,90,NA,NA),
                brent=c(80,85,94,93,NA))

我需要的是用最后的价格加上通货膨胀（假设 2%）填充每列缺失的行。我可以以一种非常蛮力的方式做到这一点：

i_rate<-0.02
for(i in c(1:nrow(prices))){
   if(is.na(prices$wti[i]))
     prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
   if(is.na(prices$brent[i]))
     prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}

在我看来，应该有一种方法可以使用 apply() 和/或 fill() 的某种组合来做到这一点，但我似乎无法让它发挥作用。

任何帮助将不胜感激。

【问题讨论】：

有点旁白，但是您在引号中使用 NA 有什么原因吗？这是您需要在真实数据中解决的问题吗？
没有理由。将在帖子中修复。

标签： r dataframe dplyr

【解决方案1】：

正如@camille 所指出的，dplyr::lag 的问题在于它不适用于连续的NAs，因为它使用向量的“原始”ith 元素而不是“修订的” ith 元素。我们必须首先创建一个 lag 的版本，它会通过创建一个新函数来做到这一点：

impute_inflation <- function(x, rate) {
  output <- x
  y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.

  for (i in seq_len(length(output))) {
    if (i == 1) {
      y[i] <- output[i] #To avoid an error attempting to use the 0th element.
    } else {
      y[i] <- output[i - 1]
    }

    if (is.na(output[i])) {
      output[i] <- y[i] * (1 + rate)
    } else {
      output[i]
    }
  }
  output
}

然后使用dplyr::mutate_at() 将其应用于一堆变量是很困难的：

library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)

  year    wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86

【讨论】：

谢谢@Phil。很高兴看到我的 Twitter 成瘾在 SO 中得到了回报。我会尝试这个并再次发布，但有一个简单的问题 - 因为在我的情况下，NA 将会吸收（一旦系列进入 NA，它将保持这种状态），我想我可以使用下半场如果你的循环，但我会玩一下。
效果很好。谢谢！

【解决方案2】：

您可以使用dplyr::lag 获取给定列中的前一个值。您的滞后值如下所示：

library(dplyr)

inflation_factor <- 1.02

prices <- data_frame(year=2018:2022,
                     wti=c(75,80,90,NA,NA),
                     brent=c(80,85,94,93,NA)) %>%
  mutate_at(vars(wti, brent), as.numeric)

prices %>%
  mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#>    year   wti brent prev_wti
#>   <int> <dbl> <dbl>    <dbl>
#> 1  2018    75    80       NA
#> 2  2019    80    85       75
#> 3  2020    90    94       80
#> 4  2021    NA    93       90
#> 5  2022    NA    NA       NA

当值为NA 时，将滞后值乘以通货膨胀因子。如您所见，这并不能处理连续的NAs。

prices %>%
  mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
         brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#>    year   wti brent
#>   <int> <dbl> <dbl>
#> 1  2018  75    80  
#> 2  2019  80    85  
#> 3  2020  90    94  
#> 4  2021  91.8  93  
#> 5  2022  NA    94.9

或者为了扩展它并避免一遍又一遍地进行相同的乘法运算，gather 将数据转换为长格式，在每个组内（wti、brent 或您可能拥有的任何其他)，并根据需要调整值。然后就可以spread恢复原形了：

prices %>%
  tidyr::gather(key = key, value = value, wti, brent) %>%
  group_by(key) %>%
  mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
  tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#>    year brent   wti
#>   <int> <dbl> <dbl>
#> 1  2018  80    75  
#> 2  2019  85    80  
#> 3  2020  94    90  
#> 4  2021  93    91.8
#> 5  2022  94.9  NA

由reprex package (v0.2.0) 于 2018 年 7 月 12 日创建。

【讨论】：

谢谢！不过，连续的 NA 确实是问题的核心——在我的实际模型中，我将（可能）通过“最后加上通货膨胀”来填补数十年的缺失值。我可能应该在我的可重现示例中更清楚地说明这一点。