【问题标题】:Combing two data frames by group with differing number of observations in r按组组合两个数据帧,在 r 中具有不同数量的观察值
【发布时间】:2021-02-15 19:04:01
【问题描述】:

我有两个数据框。第一个数据框包含三个变量 ID、纬度和经度,长度为 368。第二个数据框包含三个变量 ID、日期和值,长度为 3,058,478。每个 ID 每天都有多个观测值,并且在第二个数据集中有 10 年的每日测量值。

   DT1:                                  DT2: 
           ID    Latitude  Longitude            ID       Date        value
           1     38.2     -121.1                1       2000-01-01    3.1
           1     38.0     -123.1                1       2000-01-01    3.1
           1     33.8     -118.1                1       2000-01-01    3.1 
           1     34.9     -117.1                1       2000-01-01    3.8
           1     32.6     -117.1                1       2000-01-01    4.3
           1     37.6     -119.1                10      2000-01-01    3.2
          10     38.3     -121.1                10      2000-01-01    3.6
          10     39.8     -122.1                10      2000-01-01    1.2
          10     37.9     -122.1                10      2000-01-01    3.6
          10     39.5     -122.1                10      2000-01-01    1.1
          10     38.3     -122.1
   

我想从 DT1 获取 ID 1 的前 5 个观察值,并将它们与 ID 1 的 DT2 合并,并对 DT2 中的所有 ID 重复该操作。 DT1 中每个 ID 的观察数将等于或大于 DT2 中 ID 的观察数。每次 DT1 中有一个 ID 具有更多观察值时,我只想选择与 DT2 中的观察值数量匹配的前 n 个观察值。 DT2 必须按日期和 ID 分组,然后纬度和经度测量值可以列绑定到该分组以获得此最终结果:

End result:
     ID   Date        value   Latitude Longitude
     1    2000-01-01  3.1      38.2    -121.1
     1    2000-01-01  3.1      38.0    -123.1
     1    2000-01-01  3.1      33.8    -118.1
     1    2000-01-01  3.8      34.9    -117.1
     1    2000-01-01  4.3      32.6    -117.1
    10    2000-01-01  3.2      38.3    -121.1
    10    2000-01-01  3.6      39.8    -122.1
    10    2000-01-01  1.2      37.9    -122.1
    10    2000-01-01  3.6      39.5    -122.1
    10    2000-01-01  1.1      38.3    -122.1

数据:

  DT2<-structure(list(Date = structure(c(10957, 10957, 10957, 
  10957, 10957, 10957, 10957, 10957, 10957, 10957, 10957, 10957, 
  10957, 10957, 10957, 10957, 10957, 10957, 10957, 10957, 10957, 
  10957, 10957, 10957, 10957, 10957, 10957, 10957, 10957, 10957
   ), class = "Date"), value = c(3.1, 3.1, 3.1, 3.8, 4.3, 
   3.2, 3.6, 1.2, 3.6, 1.1, 2.6, 3.8, 1.7, 4.8, 2.5, 1.7, 2.2, 2.8, 
  2.8, 1.8, 2.8, 3, 2.9, 3.6, 2, 2.4, 2.3, 3.4, 5.3, 5),ID = c("1", 
  "1", "1", "1", "1", "10", "10", "10", "10", "10", "1001", "1001", 
 "1001", "1001", "1001", "1002", "1002", "1002", "1002", "1002", 
  "1003", "1003", "1003", "1003", "1003", "1004", "1004", "1004", 
  "1004", "1004")), row.names = c(NA, 
  -30L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
   Date = structure(c(10957, 10957, 10957, 10957, 10957, 
   10957), class = "Date"), ID = c("1", "10", "1001", 
    "1002", "1003", "1004"), .rows = list(1:5, 6:10, 11:15, 16:20, 
    21:25, 26:30)), row.names = c(NA, -6L), class = c("tbl_df", 
    "tbl", "data.frame"), .drop = TRUE))

   DT1<-structure(list(ID = c(1, 1, 1, 1, 1, 1, 10, 10, 10, 10, 
    10, 10, 10, 10, 10, 1001, 1001, 1001, 1001, 1001, 1001, 1002, 
    1002, 1002, 1002, 1002, 1002, 1003, 1003, 1003, 1003, 1003, 1003, 
    1003, 1003, 1004, 1004, 1004, 1004, 1004, 1004, 1004, 1004), 
    Latitude = c(38.201852, 37.97231, 33.821353, 34.895007, 32.631231, 
    37.64571, 38.725282, 35.385574, 38.558228, 34.421389, 37.138333, 
    38.0313, 37.7603, 33.747236, NA, 37.535833, 32.952124, 37.482934, 
    39.338504, 37.226862, 35.1019, 39.202935, 38.006311, 34.17605, 
    33.127711, 37.950741, 37.7481, 37.9642, 36.69676, 33.67464, 
    38.654069, 38.66121, 32.79222, 37.8375, 37.07206, 36.314399, 
    34.10374, 34.448048, 37.9604, 40.776944, 37.7478, 33.9397, 
    39.166017), Longitude = c(-120.681567, -122.520004, -117.91427, 
   -117.024484, -117.059075, -118.96652, -120.821916, -119.015009, 
   -121.492981, -119.701111, -119.266667, -122.1318, -122.1925, 
   -115.820124, NA, -121.961823, -117.264088, -122.20337, -120.171291, 
   -121.979675, -115.7767, -122.017728, -121.641918, -118.31712, 
    -117.075325, -121.268523, -119.5917, -122.3403, -121.637182, 
    -117.92568, -122.901857, -121.73269, -115.56306, -119.45, 
    -122.00764, -119.64457, -117.62914, -119.231321, -122.356811, 
    -124.1775, -119.5917, -115.4108, -120.148833)), row.names = c(NA, 
     -43L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】:

    标签: r data-binding merge


    【解决方案1】:

    我们可以 slice 在按 'ID' 分组后执行 rbind 或 bind_rows

    library(dplyr)
    DT2 %>% 
         ungroup %>% 
         count(ID) %>%
        right_join(DT1 %>% 
                 mutate(ID = as.character(ID))) %>%
                 group_by(ID) %>%
                 slice(seq_len(first(n))) %>% 
                 select(-n) %>%
        bind_cols(DT2 %>%
                   ungroup %>% 
                   select(-ID))
    #Joining, by = "ID"
    # A tibble: 30 x 5
    # Groups:   ID [6]
       ID    Latitude Longitude Date       value
       <chr>    <dbl>     <dbl> <date>     <dbl>
     1 1         38.2     -121. 2000-01-01   3.1
     2 1         38.0     -123. 2000-01-01   3.1
     3 1         33.8     -118. 2000-01-01   3.1
     4 1         34.9     -117. 2000-01-01   3.8
     5 1         32.6     -117. 2000-01-01   4.3
     6 10        38.7     -121. 2000-01-01   3.2
     7 10        35.4     -119. 2000-01-01   3.6
     8 10        38.6     -121. 2000-01-01   1.2
     9 10        34.4     -120. 2000-01-01   3.6
    10 10        37.1     -119. 2000-01-01   1.1
    # … with 20 more rows
    

    【讨论】:

    • 感谢您的回复! DT2 中的观察次数每组发生变化。它并不总是 5,有时可能是 4 或 6。我将如何修改上面的代码来解决这个问题?
    • @Sara 你能告诉我你是如何确定n 的吗?你需要一个函数
    • @Sara 如果这是你想要的,请检查更新
    • 我提供的数据是前 30 个观察值,碰巧每个 ID 有 5 个观察值,但在我的完整数据中,每组可能有 4 到 6 个。
    • @Sara slice_head 如果每组有 5 行或更多行,则将占用 5 行。如果小于,则取最大行数
    猜你喜欢
    • 2014-04-12
    • 2019-03-29
    • 1970-01-01
    • 2020-08-09
    • 2022-01-22
    • 1970-01-01
    • 1970-01-01
    • 2021-11-04
    • 2023-01-16
    相关资源
    最近更新 更多