在 data.frame 中添加行，使每个组的长度相等答案

【问题标题】：Add rows in a data.frame so that each group is of equal length在 data.frame 中添加行，使每个组的长度相等
【发布时间】：2021-07-17 17:43:18
【问题描述】：

我有以下 data.frame，模仿时间序列分析：

df <- data.frame(country = rep(c("US", "GB", "DK"), each = 18),
             y = runif(54),
             time = c(-8:9, 0:17, -17:0))

意思是，我有 18 年的数据，在这 18 年的某处发生了感兴趣的事件。 time 列将特定年份设置为零并酌情向前/向后计数。

我需要使每个组（美国、英国、丹麦）的大小相同，设置为可能的最大值，用NA 填充所有缺失的数据。换句话说，我需要最终的 data.frame 看起来像这样：

df2 <- data.frame(country = rep(c("US", "GB", "DK"), each = 18+17),
             y = c(rep(NA, 9), df[df$country == "US",]$y, rep(NA, 8),
                   df[df$country == "GB",]$y, rep(NA, 17),
                   rep(NA, 17), df[df$country == "DK",]$y),
             time = rep(-17:17, times = 3))

在实际数据中，有 176 个国家/地区，每个国家/地区的干预发生在不同年份。所以我真的不想像刚才那样硬编码它！有没有办法做到这一点，也许是dplyr？

【问题讨论】：

标签： r dplyr time-series

【解决方案1】：

您可以使用tidyr::complete 轻松完成此操作。由于数据将按country 分组，因此在from 和to 的seq 参数中使用min(df$time) 和max(df$time) 而不是min(time) 和max(time)，因此min 和@987654 的min 和@987654使用df$time 代替分组的最大值和最小值。

library(dplyr)
library(tidyr)

df %>% group_by(country) %>%
  complete(time = seq(min(df$time), max(df$time), 1))

# A tibble: 105 x 3
# Groups:   country [3]
   country  time       y
   <chr>   <dbl>   <dbl>
 1 DK        -17 0.0796 
 2 DK        -16 0.361  
 3 DK        -15 0.503  
 4 DK        -14 0.415  
 5 DK        -13 0.426  
 6 DK        -12 0.0370 
 7 DK        -11 0.00867
 8 DK        -10 0.0254 
 9 DK         -9 0.619  
10 DK         -8 0.862  
# ... with 95 more rows

查看上述结果的最后 12 行

df %>% group_by(country) %>%
  complete(time = seq(min(df$time), max(df$time), 1)) %>%
  ungroup() %>% tail(12)

# A tibble: 12 x 3
   country  time      y
   <chr>   <dbl>  <dbl>
 1 US          6  0.957
 2 US          7  0.265
 3 US          8  0.216
 4 US          9  0.445
 5 US         10 NA    
 6 US         11 NA    
 7 US         12 NA    
 8 US         13 NA    
 9 US         14 NA    
10 US         15 NA    
11 US         16 NA    
12 US         17 NA

一旦知道complete 是如何工作的，上述工作就可以在一行代码中完成——

complete(df, time, nesting(country))
# A tibble: 105 x 3
    time country       y
   <int> <chr>     <dbl>
 1   -17 DK       0.0796
 2   -17 GB      NA     
 3   -17 US      NA     
 4   -16 DK       0.361 
 5   -16 GB      NA     
 6   -16 US      NA     
 7   -15 DK       0.503 
 8   -15 GB      NA     
 9   -15 US      NA     
10   -14 DK       0.415 
# ... with 95 more rows

【讨论】：

dplyr 和 StackOverflow 永远不会停止惊奇。谢谢！
很高兴能帮上忙。另见编辑。