【问题标题】:Insert new rows in dataframe based on another dataframe根据另一个数据框在数据框中插入新行
【发布时间】:2018-03-15 13:34:39
【问题描述】:

样本数据

dat <- data.table(yr = c(2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012),                  
                         location = c("Bh","Bh","Bh","Bh","Bh","Go","Go","Go","Go","Go","Bh","Bh","Bh","Bh","Bh","Bh","Go","Go","Go","Go","Go"),
                          time.period = c("t4","t5","t6","t7","t8","t3","t4","t5","t6","t7","t3","t4","t5","t6","t7","t8","t3","t4","t5","t6","t7"),
                          period = c(20,21,22,23,24,19,20,21,22,23,19,20,21,22,23,24,19,20,21,22,23),
                          value = c(runif(21)))

key <- data.table(time.period = c("t1","t2","t3","t4","t5","t6","t7","t8","t9","t10"),
                           period = c(17,18,19,20,21,22,23,24,25,26))

key 为每个time.period 提供关联的period

在数据表dat 中,对于每个locationyr,如果缺少一对time.periodperiod,我想插入额外的行

例如。对于位置 Bhyr 2013

        dat[location == "Bh" & yr == 2013,]

            yr    location    time.period  period      value
        1: 2013       Bh          t4       20      0.7167561
        2: 2013       Bh          t5       21      0.5659722
        3: 2013       Bh          t6       22      0.8549229
        4: 2013       Bh          t7       23      0.1046213
        5: 2013       Bh          t8       24      0.8144670

我想做:

            yr    location    time.period  period      value
        1: 2013       Bh          t1        17       0
        1: 2013       Bh          t2        18       0
        1: 2013       Bh          t3        19       0
        1: 2013       Bh          t4        20       0.7167561
        2: 2013       Bh          t5        21       0.5659722
        3: 2013       Bh          t6        22       0.8549229
        4: 2013       Bh          t7        23       0.1046213
        5: 2013       Bh          t8        24       0.8144670
        1: 2013       Bh          t9        25       0
        1: 2013       Bh          t10       26       0

我试过这个:

   dat %>% group_by(location,yr) %>% complete(period = seq(17, max(26), 1L))

   A tibble: 40 x 5
   Groups:   location, yr [4]
          location    yr period time.period      value
             <chr>   <dbl>  <dbl>       <chr>      <dbl>
     1       Bh      2012     17        <NA>         NA
     2       Bh      2012     18        <NA>         NA
     3       Bh      2012     19          t3 0.46757583
     4       Bh      2012     20          t4 0.07041745
     5       Bh      2012     21          t5 0.58707367
     6       Bh      2012     22          t6 0.83271673
     7       Bh      2012     23          t7 0.76918731
     8       Bh      2012     24          t8 0.25368225
     9       Bh      2012     25        <NA>         NA
    10       Bh      2012     26        <NA>         NA
    # ... with 30 more rows

如您所见,time.period 未填充。我该如何填写该列?

【问题讨论】:

标签: r dplyr data.table


【解决方案1】:

tidyr::complete 可用于寻找解决方案。

library(dplyr)
library(tidyr)
dat %>% complete(yr, location, key, fill = list(value = 0)) )

# # A tibble: 40 x 5
#    yr  location time.period period  value
#   <dbl> <chr>    <chr>        <dbl> <dbl>
# 1  2012 Bh       t1            17.0 0    
# 2  2012 Bh       t2            18.0 0    
# 3  2012 Bh       t3            19.0 0.177
# 4  2012 Bh       t4            20.0 0.687
# 5  2012 Bh       t5            21.0 0.384
# 6  2012 Bh       t6            22.0 0.770
# 7  2012 Bh       t7            23.0 0.498
# 8  2012 Bh       t8            24.0 0.718
# 9  2012 Bh       t9            25.0 0    
# 10  2012 Bh       t10           26.0 0    
# # ... with 30 more rows

数据

dat <- data.table(yr = c(2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012),                  
                         location = c("Bh","Bh","Bh","Bh","Bh","Go","Go","Go","Go","Go","Bh","Bh","Bh","Bh","Bh","Bh","Go","Go","Go","Go","Go"),
                          time.period = c("t4","t5","t6","t7","t8","t3","t4","t5","t6","t7","t3","t4","t5","t6","t7","t8","t3","t4","t5","t6","t7"),
                          period = c(20,21,22,23,24,19,20,21,22,23,19,20,21,22,23,24,19,20,21,22,23),
                          value = c(runif(21)))

key <- data.table(time.period = c("t1","t2","t3","t4","t5","t6","t7","t8","t9","t10"),
                           period = c(17,18,19,20,21,22,23,24,25,26))

【讨论】:

  • @Crop89 我很乐意提供帮助。感谢您提供示例数据以及解决方案。它有助于找到解决方案:-)
【解决方案2】:

由于您使用的是data.table,您可以执行以下操作:

dat_new <- dat[,.SD[key, on='time.period'],.(location, yr)]
dat_new[, period := i.period][, i.period := NULL]
dat_new[is.na(value), value := 0]

print(head(dat_new), 10)

    location   yr time.period period     value
 1:       Bh 2013          t1     17 0.0000000
 2:       Bh 2013          t2     18 0.0000000
 3:       Bh 2013          t3     19 0.0000000
 4:       Bh 2013          t4     20 0.9255600
 5:       Bh 2013          t5     21 0.3816035
 6:       Bh 2013          t6     22 0.5202268
 7:       Bh 2013          t7     23 0.5326466
 8:       Bh 2013          t8     24 0.5091590
 9:       Bh 2013          t9     25 0.0000000
10:       Bh 2013         t10     26 0.0000000

说明:

1.首先,我们将key数据框与dat中的每组.(location, yr)连接起来。
2. 这会将列关键数据框添加为i.period
3. 最后,我们将NA设置为0,并在设置period := i.period后删除i.period列。

【讨论】:

    【解决方案3】:

    你需要这样的东西吗?

    x <- merge(dat, key, by = "time.period", all.y = T)
    x[is.na(x)] <- 0
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-04-27
      • 1970-01-01
      • 2019-01-31
      • 1970-01-01
      • 2020-11-30
      • 1970-01-01
      • 2023-03-12
      相关资源
      最近更新 更多