【问题标题】:Group observations chronologically and by group R / data.table [duplicate]按时间顺序和按组 R / data.table 分组观察[重复]
【发布时间】:2023-03-31 20:54:01
【问题描述】:

我有以下问题: 我有一个如下所示的数据集:

library(data.table)

dt <- 
  data.table(
  student = c(rep(1, 8), rep(2, 8)), 
  year = rep(2001:2008, 2),
  track = c(rep("Highschool", 3), rep("Vocational", 2), rep("Uni", 1), rep("Vocational", 2), 
            rep("Vocational", 2), rep("Highschool", 4), rep("Vocational", 2))
)

#  student year      track
# 1:       1 2001 Highschool
# 2:       1 2002 Highschool
# 3:       1 2003 Highschool
# 4:       1 2004 Vocational
# 5:       1 2005 Vocational
# 6:       1 2006        Uni
# 7:       1 2007 Vocational
# 8:       1 2008 Vocational
# 9:       2 2001 Vocational
#10:       2 2002 Vocational
#11:       2 2003 Highschool
#12:       2 2004 Highschool
#13:       2 2005 Highschool
#14:       2 2006 Highschool
#15:       2 2007 Vocational
#16:       2 2008 Vocational

如您所见,数据按时间顺序跟踪学生在特定年份接受的教育类型。 我想为跟踪学生注册的程序类型的数量分配唯一标识符,同时保持时间顺序。 因此,我希望我的 data.table 看起来像这样:

dt[, tracker := c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), 
                  rep(1, 2), rep(2, 4), rep(3, 2))]
#    student year      track tracker
# 1:       1 2001 Highschool       1
# 2:       1 2002 Highschool       1
# 3:       1 2003 Highschool       1
# 4:       1 2004 Vocational       2
# 5:       1 2005 Vocational       2
# 6:       1 2006        Uni       3
# 7:       1 2007 Vocational       4
# 8:       1 2008 Vocational       4
# 9:       2 2001 Vocational       1
#10:       2 2002 Vocational       1
#11:       2 2003 Highschool       2
#12:       2 2004 Highschool       2
#13:       2 2005 Highschool       2
#14:       2 2006 Highschool       2
#15:       2 2007 Vocational       3
#16:       2 2008 Vocational       3

我现在想出了以下解决方案:

dt[, helper := ifelse(shift(track) == track, 0, 1)]
dt[1, helper := 0]
dt[, tracker := cumsum(helper) + 1, by = "student"]

dt
# student year      track helper tracker
# 1:       1 2001 Highschool      0       1
# 2:       1 2002 Highschool      0       1
# 3:       1 2003 Highschool      0       1
# 4:       1 2004 Vocational      1       2
# 5:       1 2005 Vocational      0       2
# 6:       1 2006        Uni      1       3
# 7:       1 2007 Vocational      1       4
# 8:       1 2008 Vocational      0       4
# 9:       2 2001 Vocational      0       1
#10:       2 2002 Vocational      0       1
#11:       2 2003 Highschool      1       2
#12:       2 2004 Highschool      0       2
#13:       2 2005 Highschool      0       2
#14:       2 2006 Highschool      0       2
#15:       2 2007 Vocational      1       3
#16:       2 2008 Vocational      0       3

现在我想知道:使用 data.table/dplyr/base 语法是否有更“直接”的方式来实现我的目标?

【问题讨论】:

    标签: r dplyr data.table


    【解决方案1】:

    data.table::rleid(): 相同值的连续运行属于同一组

    dt[, tracker := rleid(track), by = student]
    
        student year      track tracker
     1:       1 2001 Highschool       1
     2:       1 2002 Highschool       1
     3:       1 2003 Highschool       1
     4:       1 2004 Vocational       2
     5:       1 2005 Vocational       2
     6:       1 2006        Uni       3
     7:       1 2007 Vocational       4
     8:       1 2008 Vocational       4
     9:       2 2001 Vocational       1
    10:       2 2002 Vocational       1
    11:       2 2003 Highschool       2
    12:       2 2004 Highschool       2
    13:       2 2005 Highschool       2
    14:       2 2006 Highschool       2
    15:       2 2007 Vocational       3
    16:       2 2008 Vocational       3
    

    没有rleid()只是为了好玩:

    dt[, tracker := cumsum(shift(track, fill = track[1]) != track) + 1L, by = student]
    

    【讨论】:

      【解决方案2】:

      在基地

      dt$tracker <- unsplit(tapply(dt$track,dt$student, function(x) c(1,1+cumsum(diff( as.numeric(factor(x)))!= 0 ))),dt$student)
      

      输出:

            student year      track tracker
       1:       1 2001 Highschool       1
       2:       1 2002 Highschool       1
       3:       1 2003 Highschool       1
       4:       1 2004 Vocational       2
       5:       1 2005 Vocational       2
       6:       1 2006        Uni       3
       7:       1 2007 Vocational       4
       8:       1 2008 Vocational       4
       9:       2 2001 Vocational       1
      10:       2 2002 Vocational       1
      11:       2 2003 Highschool       2
      12:       2 2004 Highschool       2
      13:       2 2005 Highschool       2
      14:       2 2006 Highschool       2
      15:       2 2007 Vocational       3
      16:       2 2008 Vocational       3
      

      【讨论】:

        猜你喜欢
        • 2020-12-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-06-04
        • 1970-01-01
        • 2021-12-16
        • 2021-06-26
        • 2023-03-07
        相关资源
        最近更新 更多