【发布时间】:2015-06-15 12:25:21
【问题描述】:
我正在处理包含数千行的贸易数据集。每条记录都有一个基于符号和日期的唯一键。给定符号的交易记录是不规则的,因此使用 zoo 将是自然的选择。我需要使用滞后和合并来创建一个新的数据集。但是,我不知道如何在动物园中设置多列索引以使用滞后功能。以下是示例数据集和预期输出。
df = data.frame(
dt = as.Date(c("2015-01-01", "2015-01-05", "2015-01-06",
"2015-01-01", "2015-01-02")),
id = c("i1", "i1", "i1", "i2", "i2"),
v1 = c(110, 115, 119, 212, 213),
v2 = c(100, 170, 180, 202, 210),
v3 = c(11, 13, 16, 22, 24)
)
df$id = as.character(df$id)
输出应该是
2015-01-01, i1, 110, 100, 11, 2015-01-05, i1, 115, 170, 13
2015-01-05, i1, 115, 170, 13, 2015-01-06, i1, 119, 180, 16
2015-01-06, i1, 119, 180, 16, NA, NA, NA, NA, NA
2015-01-01, i2, 212, 202, 22, 2015-01-02, i2, 213, 210, 24
2015-01-02, i2, 213, 210, 24, NA, NA, NA, NA, NA
在 SO 中,有许多帖子完成“分组”滞后操作,但仅针对单个列。我正在寻找合并完整行,无论列数如何。
更新这个问题...
以下是解决基于zoo的“分组”滞后操作的一种可能方法。
doProcessing = function(df){
icolnames = colnames(df)
tt = zoo(df, df$dt)
tt1 = merge(tt, lag(tt, 1))
colnames(tt1) = c(icolnames, paste0("lag_", icolnames))
data.frame(tt1, stringsAsFactors=F)
}
fin_df = do.call(rbind, with(df, by(df, list(id), doProcessing, simplify=F)))
这个最终输出帧将每个字段都作为因子。如何根据输入数据框获得正确的输出结构?
基于@Grothendieck 的 lapply 思想,下面给出了上述问题的可能解决方案。
doProcessing = function(df){
icolnames = colnames(df)
tt = zoo(df, df$dt)
tt1 = merge(tt, lag(tt, 1))
colnames(tt1) = c(icolnames, paste0("lag_", icolnames))
data.frame(tt1, stringsAsFactors=F)
}
fin_df = do.call(rbind, with(df, by(df, list(id), doProcessing, simplify=F)))
仍然需要一些帮助,一些结果数据框如何将每一列作为因素。如何恢复原始结构?
原始数据帧结构
> str(df)
'data.frame': 5 obs. of 5 variables:
$ dt: Date, format: "2015-01-05" "2015-01-01" ...
$ id: chr "i1" "i1" "i1" "i2" ...
$ v1: num 115 110 119 212 213
$ v2: num 170 100 180 202 210
$ v3: num 13 11 16 22 24
结果数据框看起来像
> str(fin_df)
'data.frame': 5 obs. of 10 variables:
$ dt : Factor w/ 4 levels "2015-01-01","2015-01-05",..: 1 2 3 1 4
$ id : Factor w/ 2 levels "i1","i2": 1 1 1 2 2
$ v1 : Factor w/ 5 levels "110","115","119",..: 1 2 3 4 5
$ v2 : Factor w/ 5 levels "100","170","180",..: 1 2 3 4 5
$ v3 : Factor w/ 5 levels "11","13","16",..: 1 2 3 4 5
$ lag_dt: Factor w/ 3 levels "2015-01-05","2015-01-06",..: 1 2 NA 3 NA
$ lag_id: Factor w/ 2 levels "i1","i2": 1 1 NA 2 NA
$ lag_v1: Factor w/ 3 levels "115","119","213": 1 2 NA 3 NA
$ lag_v2: Factor w/ 3 levels "170","180","210": 1 2 NA 3 NA
$ lag_v3: Factor w/ 3 levels "13","16","24": 1 2 NA 3 NA
【问题讨论】: