【发布时间】:2019-04-22 02:35:51
【问题描述】:
时间序列数据包括:
产品(分类);产品组(分类);国家(分类); YearSinceProductLaunch(数字); SalesAtLaunchYear(数字)
只有“SalesAtLaunchYear”数据有一些需要估算的缺失值。
对于某些产品,有完整的数据,即存在从发布年份 1,2 到现在的销售数据。
但是,其他一些产品仅包含自推出以来最初几年的销售数据缺失。产品有不同的年龄,因此有时会缺少 2 年的发布时间,有时会缺少 10 年,这取决于产品。
我有兴趣在 R 中找到一个可以估算缺失的时间序列数据缺口的模型。我通过将“SalesAtLaunchYear”的模型设置为随机森林来尝试 MICE,但我仍然获得了一些非常高的销售额值,尤其是在产品发布之初。我确保在第 0 年,所有销售额均为 0,以避免出现负值。数据框有 20000 行,包含 300 个独特的产品。
testdf = tibble::tribble(
~Country, ~ProductGroup, ~Product, ~YearSinceProductLaunch, ~SalesAtLaunchYear,
"CA", "ProductGroup1", "Product1", 0L, 0,
"CA", "ProductGroup1", "Product1", 1L, NA,
"CA", "ProductGroup1", "Product1", 2L, NA,
"CA", "ProductGroup1", "Product1", 3L, NA,
"CA", "ProductGroup1", "Product1", 4L, NA,
"CA", "ProductGroup1", "Product1", 5L, 206034.9814,
"CA", "ProductGroup1", "Product1", 6L, 170143.2623,
"CA", "ProductGroup1", "Product1", 7L, 212541.9306,
"CA", "ProductGroup1", "Product1", 8L, 270663.199,
"CA", "ProductGroup1", "Product1", 9L, 736738.3755,
"CA", "ProductGroup1", "Product1", 10L, 2579723.981,
"CA", "ProductGroup1", "Product1", 11L, 4964319.496,
"CA", "ProductGroup1", "Product1", 12L, 6864985.16,
"CA", "ProductGroup1", "Product1", 13L, 8793292.386,
"CA", "ProductGroup1", "Product1", 14L, 11416033.38,
"IT", "ProductGroup2", "Product2", 0L, 0,
"IT", "ProductGroup2", "Product2", 1L, NA,
"IT", "ProductGroup2", "Product2", 2L, NA,
"IT", "ProductGroup2", "Product2", 3L, NA,
"IT", "ProductGroup2", "Product2", 4L, NA,
"IT", "ProductGroup2", "Product2", 5L, NA,
"IT", "ProductGroup2", "Product2", 6L, NA,
"IT", "ProductGroup2", "Product2", 7L, NA,
"IT", "ProductGroup2", "Product2", 8L, NA,
"IT", "ProductGroup2", "Product2", 9L, NA,
"IT", "ProductGroup2", "Product2", 10L, NA,
"IT", "ProductGroup2", "Product2", 11L, NA,
"IT", "ProductGroup2", "Product2", 12L, NA,
"IT", "ProductGroup2", "Product2", 13L, 30806222.96,
"IT", "ProductGroup2", "Product2", 14L, 31456272,
"IT", "ProductGroup2", "Product2", 15L, 31853476.78,
"IT", "ProductGroup2", "Product2", 16L, 30379818,
"IT", "ProductGroup2", "Product2", 17L, 29765448.87,
"IT", "ProductGroup2", "Product2", 18L, 31376234,
"IT", "ProductGroup2", "Product2", 19L, 32628514.81,
"IT", "ProductGroup2", "Product2", 20L, 32732196,
"IT", "ProductGroup2", "Product2", 21L, 33503784.25,
"IT", "ProductGroup2", "Product2", 22L, 35163372,
"DE", "ProductGroup3", "Product3", 0L, 0,
"DE", "ProductGroup3", "Product3", 1L, 161884.081,
"DE", "ProductGroup3", "Product3", 2L, 7876925.474,
"DE", "ProductGroup3", "Product3", 3L, 12948209.55,
"DE", "ProductGroup3", "Product3", 4L, 13304401.76
)
testdf$Country = as.factor(testdf$Country)
testdf$ProductGroup = as.factor(testdf$ProductGroup)
testdf$Product = as.factor(testdf$Product)
【问题讨论】:
-
这个问题有一些数据会更好
-
您能否通过分享您的数据样本来重现您的问题,以便其他人可以提供帮助(请不要使用
str()、head()或屏幕截图)?您可以使用reprex和datapasta包来帮助您。另见Help me Help you & How to make a great R reproducible example? -
我已经放了测试数据。
标签: r time-series missing-data imputation r-mice