将行分组为 3 个滚动组，并将每组组合成一行答案

【问题标题】：Group rows into rolling sets of 3 and combine each set into a single row将行分组为 3 个滚动组，并将每组组合成一行
【发布时间】：2017-07-23 07:39:30
【问题描述】：

我有一个 data.frame，目前每行有一条记录，但我想将其转换为每行三条记录（为机器学习算法提供更多趋势数据）。

作为一个例子，我的 data.frame 目前看起来像这样（但变量不仅仅是排名和速度）：

Date  | Participant | Ctry | Rank | Speed
----- |-------------|------|------|-------
17/01 | 1           | AU   | 1    | 0.9   
18/01 | 1           | AU   | 4    | 0.6   
19/01 | 1           | AU   | 2    | 0.7   
20/01 | 1           | AU   | 1    | 0.4   
17/01 | 2           | ZA   | 5    | 0.3   
18/01 | 2           | ZA   | 3    | 0.5   
19/01 | 2           | ZA   | 4    | 0.6

我想将其转换为如下所示（在每个参与者 3 个滚动窗口中）：

StartDate  | Participant | Ctry | Rank_1 | Rank_2 | Rank_3 | Speed_1 | Speed_2 | Speed_3
---------- | ----------- | ---- | ------ | ------ | ------ | ------- | ------- | -------
17/01      | 1           | AU   | 1      | 4      | 2      | 0.9     | 0.6     | 0.7
18/01      | 1           | AU   | 4      | 2      | 1      | 0.6     | 0.7     | 0.4
17/01      | 2           | ZA   | 5      | 3      | 4      | 0.3     | 0.5     | 0.6

我可以使用嵌套的for 循环来创建此数据结构，但我确信有一种更有效的方法来执行此操作。我研究了 reshape(2) 和 dplyr 函数，但找不到适用于具有多个变量的滚动窗口的东西。

【问题讨论】：

我不清楚。
我同意@Sotos。您的示例 df 每个数据/参与者/Ctry 仅显示一个值，因此尚不清楚 Rank_1、Rank_2 etecetera 的值应来自何处。虽然我现在看到它们可能是最近三个日期的值。是这样吗？如果是，如果任何参与者缺少日期，预期的行为是什么？

标签： r

【解决方案1】：

OP 已要求将数据从长格式重塑为一种特殊的宽格式，其中每行将包含三个记录，最后。例如，参与者1 将有一行包含17/01、18/01 和19/01 的值，第二行包含18/01、19/01 和@987654327 的值@。

请注意，此操作将添加冗余数据，因为某些值在整形后可能会出现多达 3 次。另请注意，OP 已要求同时重塑多个值变量。这是data.table 软件包的最新版本中添加的一项功能。

以下是使用 data.table 包中的 shift()、melt()、dcast()、rowid() 和 join 的解决方案：

library(data.table)
# define number of records per row
n_recs <- 3L
# create sequences of dates to be included per row using shift() with multiple offsets,
# keep only complete sequences, add StartDate column for later dcast()
windows <- na.omit(DT[, shift(Date, seq_len(n_recs) - 1L, type = "lead"), by = Participant])[
  , StartDate := V1]
# reshape to long form for later join, 
# rename variables for automatic creation of column names in dcast()
lwin <- melt(windows, id.vars = c("Participant", "StartDate"), value.name = "Date")[
    , variable := stringi::stri_replace(variable, fixed = "V", "")]
# right join with original data to create additional rows,
# reshape from long to wide form using multiple value vars,
# reorder for convenience 
dcast(
  DT[lwin, on = .(Participant, Date)], 
  StartDate + Participant + Ctry ~ variable, value.var = c("Rank", "Speed"))[
    order(Participant, StartDate)]

   StartDate Participant Ctry Rank_1 Rank_2 Rank_3 Speed_1 Speed_2 Speed_3
1:     17/01           1   AU      1      4      2     0.9     0.6     0.7
2:     18/01           1   AU      4      2      1     0.6     0.7     0.4
3:     17/01           2   ZA      5      3      4     0.3     0.5     0.6

数据

library(data.table)
DT <- fread(
  "Date  | Participant | Ctry | Rank | Speed
  17/01 | 1           | AU   | 1    | 0.9   
  18/01 | 1           | AU   | 4    | 0.6   
  19/01 | 1           | AU   | 2    | 0.7   
  20/01 | 1           | AU   | 1    | 0.4   
  17/01 | 2           | ZA   | 5    | 0.3   
  18/01 | 2           | ZA   | 3    | 0.5   
  19/01 | 2           | ZA   | 4    | 0.6   ",
  sep = "|"
)

编辑

我已经认识到，上面的代码依赖于隐含的假设，即每个参与者至少有尽可能多的记录应该组合在一起。 OP 的样本数据包含参与者 1 的 4 行和参与者 2 的 3 行，因此满足此条件。

但是，如果每个参与者只有一两行，na.omit() 将从最终结果中完全删除这些参与者。也许，这对于 OP 的目标可能是可取的。如果不是，代码需要修改如下：

# create new sample data including cases with less than 3 records per participant
DT <- fread(
  "Date  | Participant | Ctry | Rank | Speed
  17/01 | 1           | AU   | 1    | 0.9   
  18/01 | 1           | AU   | 4    | 0.6   
  19/01 | 1           | AU   | 2    | 0.7   
  20/01 | 1           | AU   | 1    | 0.4   
  17/01 | 2           | ZA   | 5    | 0.3   
  18/01 | 2           | ZA   | 3    | 0.5   
  19/01 | 2           | ZA   | 4    | 0.6   
  17/01 | 3           | DE   | 2    | 0.8,
  17/01 | 4           | DK   | 3    | 0.8,
  18/01 | 4           | DK   | 4    | 0.8",
  sep = "|"
) 

# modified code
n_recs <- 3L
min_rows <- 1L
windows <- DT[, lapply(shift(Date, seq_len(n_recs) - 1L, type = "lead"), 
                       head, n = pmax(.N - n_recs + 1L, min_rows)), 
              by = Participant][, StartDate := V1]
lwin <- melt(windows, id.vars = c("Participant", "StartDate"), value.name = "Date", 
             na.rm = TRUE)[
  , variable := stringi::stri_replace(variable, fixed = "V", "")]
dcast(
  DT[lwin, on = .(Participant, Date)], 
  StartDate + Participant + Ctry ~ variable, value.var = c("Rank", "Speed"))[
    order(Participant, StartDate)]

   StartDate Participant Ctry Rank_1 Rank_2 Rank_3 Speed_1 Speed_2 Speed_3
1:     17/01           1   AU      1      4      2     0.9     0.6     0.7
2:     18/01           1   AU      4      2      1     0.6     0.7     0.4
3:     17/01           2   ZA      5      3      4     0.3     0.5     0.6
4:     17/01           3   DE      2     NA     NA    0.8,      NA      NA
5:     17/01           4   DK      3      4     NA    0.8,     0.8      NA

请注意“不完整”的第 4 行和第 5 行，因为缺少参与者 3 和 4 的输入数据。但是，可以确保所有参与者都会出现在最终结果中。

这是通过在计算windows 时使用head() 明确限制为每个参与者创建的行数来实现的。另外，melt()现在必须用参数na.rm = TRUE调用。

如果将min_rows 设置为0L，则不完整的第 4 行和第 5 行将从最终结果中消失。

【讨论】：