R中时间序列缺失数据的插补模型答案

【问题标题】：Imputation model for time series missing data in RR中时间序列缺失数据的插补模型
【发布时间】：2019-04-22 02:35:51
【问题描述】：

时间序列数据包括：

产品（分类）；产品组（分类）；国家（分类）； YearSinceProductLaunch（数字）； SalesAtLaunchYear（数字）

只有“SalesAtLaunchYear”数据有一些需要估算的缺失值。

对于某些产品，有完整的数据，即存在从发布年份 1,2 到现在的销售数据。

但是，其他一些产品仅包含自推出以来最初几年的销售数据缺失。产品有不同的年龄，因此有时会缺少 2 年的发布时间，有时会缺少 10 年，这取决于产品。

我有兴趣在 R 中找到一个可以估算缺失的时间序列数据缺口的模型。我通过将“SalesAtLaunchYear”的模型设置为随机森林来尝试 MICE，但我仍然获得了一些非常高的销售额值，尤其是在产品发布之初。我确保在第 0 年，所有销售额均为 0，以避免出现负值。数据框有 20000 行，包含 300 个独特的产品。

testdf = tibble::tribble(
  ~Country,   ~ProductGroup,   ~Product, ~YearSinceProductLaunch, ~SalesAtLaunchYear,
      "CA", "ProductGroup1", "Product1",                      0L,                  0,
      "CA", "ProductGroup1", "Product1",                      1L,                 NA,
      "CA", "ProductGroup1", "Product1",                      2L,                 NA,
      "CA", "ProductGroup1", "Product1",                      3L,                 NA,
      "CA", "ProductGroup1", "Product1",                      4L,                 NA,
      "CA", "ProductGroup1", "Product1",                      5L,        206034.9814,
      "CA", "ProductGroup1", "Product1",                      6L,        170143.2623,
      "CA", "ProductGroup1", "Product1",                      7L,        212541.9306,
      "CA", "ProductGroup1", "Product1",                      8L,         270663.199,
      "CA", "ProductGroup1", "Product1",                      9L,        736738.3755,
      "CA", "ProductGroup1", "Product1",                     10L,        2579723.981,
      "CA", "ProductGroup1", "Product1",                     11L,        4964319.496,
      "CA", "ProductGroup1", "Product1",                     12L,         6864985.16,
      "CA", "ProductGroup1", "Product1",                     13L,        8793292.386,
      "CA", "ProductGroup1", "Product1",                     14L,        11416033.38,
      "IT", "ProductGroup2", "Product2",                      0L,                  0,
      "IT", "ProductGroup2", "Product2",                      1L,                 NA,
      "IT", "ProductGroup2", "Product2",                      2L,                 NA,
      "IT", "ProductGroup2", "Product2",                      3L,                 NA,
      "IT", "ProductGroup2", "Product2",                      4L,                 NA,
      "IT", "ProductGroup2", "Product2",                      5L,                 NA,
      "IT", "ProductGroup2", "Product2",                      6L,                 NA,
      "IT", "ProductGroup2", "Product2",                      7L,                 NA,
      "IT", "ProductGroup2", "Product2",                      8L,                 NA,
      "IT", "ProductGroup2", "Product2",                      9L,                 NA,
      "IT", "ProductGroup2", "Product2",                     10L,                 NA,
      "IT", "ProductGroup2", "Product2",                     11L,                 NA,
      "IT", "ProductGroup2", "Product2",                     12L,                 NA,
      "IT", "ProductGroup2", "Product2",                     13L,        30806222.96,
      "IT", "ProductGroup2", "Product2",                     14L,           31456272,
      "IT", "ProductGroup2", "Product2",                     15L,        31853476.78,
      "IT", "ProductGroup2", "Product2",                     16L,           30379818,
      "IT", "ProductGroup2", "Product2",                     17L,        29765448.87,
      "IT", "ProductGroup2", "Product2",                     18L,           31376234,
      "IT", "ProductGroup2", "Product2",                     19L,        32628514.81,
      "IT", "ProductGroup2", "Product2",                     20L,           32732196,
      "IT", "ProductGroup2", "Product2",                     21L,        33503784.25,
      "IT", "ProductGroup2", "Product2",                     22L,           35163372,
      "DE", "ProductGroup3", "Product3",                      0L,                  0,
      "DE", "ProductGroup3", "Product3",                      1L,         161884.081,
      "DE", "ProductGroup3", "Product3",                      2L,        7876925.474,
      "DE", "ProductGroup3", "Product3",                      3L,        12948209.55,
      "DE", "ProductGroup3", "Product3",                      4L,        13304401.76
  )


testdf$Country = as.factor(testdf$Country)
testdf$ProductGroup   = as.factor(testdf$ProductGroup)
testdf$Product  = as.factor(testdf$Product)

【问题讨论】：

这个问题有一些数据会更好
您能否通过分享您的数据样本来重现您的问题，以便其他人可以提供帮助（请不要使用str()、head() 或屏幕截图）？您可以使用 reprex 和 datapasta 包来帮助您。另见Help me Help you & How to make a great R reproducible example?
我已经放了测试数据。

标签： r time-series missing-data imputation r-mice

【解决方案1】：

可能使用鼠标不会给您想要的结果。因为它主要使用变量间相关性。您正在寻找更多的时间相关性。

我对这个特定示例的建议是将数据集拆分为国家、产品组、产品组，并使用时间序列插补包对这些进行插补。

查看您的数据，我认为像 imputeTS 包中的函数 na.interpolation 之类的东西已经做得很好了。

这就是你所说的：

library("imputeTS")
na.interpolation(yourTimeSeries)

对于从每个国家、产品组、产品中创建的每个时间序列，您必须多次调用它。

你也可以直接运行

 na.interpolation(testdf$SalesAtLaunchYear)

在您的整个数据集上更容易 - 在您展示的示例中，这也可以工作。（如果其余部分的结构不同或者您使用与 imputeTS 包不同的算法，可能会导致问题）

【讨论】：

谢谢，我也已经在研究这个包了。我会尝试并报告结果。我想知道如何在这个库中控制或使用其他一些模型来控制国家效应和产品组效应。需要澄清的是，还有一些产品自推出以来的所有年份的销售情况都是可用的，从中可以了解国家/地区的增长情况。
我明白了，因此您也可以使用一些变量间的相关性。那你可以试试AMELIAII这个包。手册中的第 4.6 章/第 20 页给出了一些关于如何同时考虑时间方面的提示：cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf。它只是设置一些参数。我仍然会与结果表进行比较，例如imputeTS - 通常当时间相关性比变量间相关性强得多时，使用单一的时间序列插补方法会更好。