【问题标题】:Merging data frames by id while interweaving years and carry values forward between years在交织年份的同时按 id 合并数据帧并在年份之间结转值
【发布时间】:2022-01-13 12:28:44
【问题描述】:

我有两个要合并的数据框。它们都包含有关人员的信息,每个 idyear

其中一个是“主要”,另一个是添加信息。但是,我不能以常规方式(即merge()dplyr::left_join())合并它们,因为它们中的year 值不一定与id 匹配。因此,我想按时间顺序从第二个表中已知的内容转移到主表中的每个 year 行中。

在下面的示例中,我有两个关于军官的表格。 “主”有 3 列,分别是 idyear 和另一列 col_1

df_main_info <-
  tribble(~id, ~year, ~col_1, 
          1,   2008,  "foo",
          1,   2005,  "bar",
          1,   2010,  "blah", 
          1,   2020,  "bar",  
          2,   1999,  "foo", 
          2,   2020,  "foo",  
          3,   2002,  "bar",
          3,   2010,  "bar",
          4,   2003,  "foo",
          4,   2010,  "bar"
  )

我还有一个带有 idyear 列的附加表,用于记录每位军官获得军衔的时间和军衔:

df_ranks_history <-
  tribble(~id, ~year, ~army_rank,
          1,   2005,  "second_lieutenant",
          1,   2010,  "first_lieutenant",
          1,   2018,  "major",
          1,   2021,  "colonel",
          2,   2002,  "major",
          2,   2018,  "colonel",
          3,   1995,  "second_lieutenant",
          3,   2000,  "captain",
          3,   2012,  "colonel"
  )

年份并不严格匹配。但是,例如,如果警官id = 3 在 2000 年变成了"captain",那么我们知道在 2002 年仍然如此,所以我们可以在第 7 行的df_main_info 中输入“船长”。

因此,期望的输出应该是:


desired_output <-
  tribble(~id, ~year, ~col_1, ~army_rank,
          1,   2008,  "foo",   "second_lieutenant",
          1,   2005,  "bar",   "second_lieutenant",
          1,   2010,  "blah",  "first_lieutenant",
          1,   2020,  "bar",   "major",
          2,   1999,  "foo",   NA,
          2,   2020,  "foo",   "colonel",
          3,   2002,  "bar",   "captain",
          3,   2010,  "bar",   "captain",
          4,   2003,  "foo",   NA,
          4,   2010,  "bar",   NA
          )

如果这是相关的,排名按一定顺序排列:

us_army_officer_ranks <- c("second_lieutenant", 
                           "first_lieutenant", 
                           "captain", 
                           "major", 
                           "lieutenant_colonel", 
                           "colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant

【问题讨论】:

    标签: r merge


    【解决方案1】:
    library(dplyr)
    library(tidyr)
    
    df_main_info %>% 
      full_join(df_ranks_history, by = c("id", "year")) %>%
      group_by(id) %>%
      arrange(id, year) %>%
      fill(army_rank, .direction = "down") %>%
      filter(!is.na(col_1))
    # # A tibble: 10 × 4
    # # Groups:   id [4]
    #       id  year col_1 army_rank        
    #    <dbl> <dbl> <chr> <chr>            
    #  1     1  2005 bar   second_lieutenant
    #  2     1  2008 foo   second_lieutenant
    #  3     1  2010 blah  first_lieutenant 
    #  4     1  2020 bar   major            
    #  5     2  1999 foo   NA               
    #  6     2  2020 foo   colonel          
    #  7     3  2002 bar   captain          
    #  8     3  2010 bar   captain          
    #  9     4  2003 foo   NA               
    # 10     4  2010 bar   NA    
    

    【讨论】:

      【解决方案2】:
      library(data.table)
      
      setDT(df_main_info)
      setDT(df_ranks_history)
      
      df_ranks_history[df_main_info, on = list(id, year), roll = +Inf]
      
          id year         army_rank col_1
       1:  1 2008 second_lieutenant   foo
       2:  1 2005 second_lieutenant   bar
       3:  1 2010  first_lieutenant  blah
       4:  1 2020             major   bar
       5:  2 1999              <NA>   foo
       6:  2 2020           colonel   foo
       7:  3 2002           captain   bar
       8:  3 2010           captain   bar
       9:  4 2003              <NA>   foo
      10:  4 2010              <NA>   bar
      

      【讨论】:

        猜你喜欢
        • 2020-02-12
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2016-08-12
        • 2019-05-18
        • 2021-03-18
        相关资源
        最近更新 更多