识别按 ID 和日期分组的数据框中的第一行答案

【问题标题】：Identifying the first rows in a data frame grouped by an ID and date识别按 ID 和日期分组的数据框中的第一行
【发布时间】：2021-04-21 14:09:30
【问题描述】：

我有一个类似于以下的数据集：

dt = structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 
4, 5, 5, 6, 6, 6, 6), date = structure(c(1332288000, 1332288000, 
1360540800, 1384819200, 1384819200, 1325548800, 1326499200, 1365292800, 
1365292800, 1365292800, 1400284800, 1442966400, 1450051200, 1404864000, 
1330387200, 1330387200, 1366329600, 1366329600, 1412467200, 1412467200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), type = c("A", 
"C", "B", "A", "B", "C", "C", "A", "B", "C", "C", "A", "A", "C", 
"C", "C", "C", "B", "B", "A")), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

我有行来记录特定类型的事件（类型）的唯一个人（ID）和他们在系统中出现的日期（日期）。这些行首先按 ID 排序，然后按日期排序。您可以看到一个人可以出现在多个日期，并且每个日期内有多种事件类型。

我正在尝试创建一个额外的列（第一列）来指示/标记个人出现的第一个日期，为与他们的第一次出现日期相对应的每一行标记“1”，而不仅仅是他们出现的第一行。这个是我所追求的：

    ID       date type first
 1:  1 2012-03-21    A     1
 2:  1 2012-03-21    C     1
 3:  1 2013-02-11    B     0
 4:  1 2013-11-19    A     0
 5:  1 2013-11-19    B     0
 6:  2 2012-01-03    C     1
 7:  2 2012-01-14    C     0
 8:  2 2013-04-07    A     0
 9:  2 2013-04-07    B     0
10:  2 2013-04-07    C     0
11:  2 2014-05-17    C     0
12:  3 2015-09-23    A     1
13:  3 2015-12-14    A     0
14:  4 2014-07-09    C     1
15:  5 2012-02-28    C     1
16:  5 2012-02-28    C     1
17:  6 2013-04-19    C     1
18:  6 2013-04-19    B     1
19:  6 2014-10-05    B     0
20:  6 2014-10-05    A     0

例如，我已经看到了识别首次出现/行 here 和 here 的解决方案。但这些不是我所追求的，因为我同时按 ID 和日期分组。我尝试在按 ID 和日期分组时使用 data.table 中的重复函数，但这是在识别 ID 和日期的唯一组合：

df[!duplicated(df, by=c("ID", "date")), first := 1]

任何帮助将不胜感激 - 特别是使用 data.table 或 base r 的解决方案。

提前致谢

【问题讨论】：

标签： r data.table

【解决方案1】：

对于每个ID 分配1 到first，其中日期与第一个日期相同可以写为：

library(dplyr)

dt %>%
  group_by(ID) %>%
  mutate(first = as.integer(as.Date(date) == first(as.Date(date)))) %>%
  ungroup

在data.table：

library(data.table)
setDT(dt)[, first := as.integer(as.Date(date) == first(as.Date(date))), ID]
dt

#    ID       date type first
# 1:  1 2012-03-21    A     1
# 2:  1 2012-03-21    C     1
# 3:  1 2013-02-11    B     0
# 4:  1 2013-11-19    A     0
# 5:  1 2013-11-19    B     0
# 6:  2 2012-01-03    C     1
# 7:  2 2012-01-14    C     0
# 8:  2 2013-04-07    A     0
# 9:  2 2013-04-07    B     0
#10:  2 2013-04-07    C     0
#11:  2 2014-05-17    C     0
#12:  3 2015-09-23    A     1
#13:  3 2015-12-14    A     0
#14:  4 2014-07-09    C     1
#15:  5 2012-02-28    C     1
#16:  5 2012-02-28    C     1
#17:  6 2013-04-19    C     1
#18:  6 2013-04-19    B     1
#19:  6 2014-10-05    B     0
#20:  6 2014-10-05    A     0

【讨论】：

谢谢。使用您的 dplyr 解决方案并转换为与 data.table df[, first := as.integer(as.Date(date) == first(as.Date(date))), by = "ID"] 一起使用

【解决方案2】：

这是data.table 方法：

library(data.table)
setDT(dt)
dt[,first := fifelse(date == min(date), 1, 0), by = "ID"]
#    ID       date type first
# 1:  1 2012-03-21    A     1
# 2:  1 2012-03-21    C     1
# 3:  1 2013-02-11    B     0
# 4:  1 2013-11-19    A     0
# 5:  1 2013-11-19    B     0
# 6:  2 2012-01-03    C     1
# 7:  2 2012-01-14    C     0
# 8:  2 2013-04-07    A     0
# 9:  2 2013-04-07    B     0
#10:  2 2013-04-07    C     0
#11:  2 2014-05-17    C     0
#12:  3 2015-09-23    A     1
#13:  3 2015-12-14    A     0
#14:  4 2014-07-09    C     1
#15:  5 2012-02-28    C     1
#16:  5 2012-02-28    C     1
#17:  6 2013-04-19    C     1
#18:  6 2013-04-19    B     1
#19:  6 2014-10-05    B     0
#20:  6 2014-10-05    A     0

【讨论】：

【解决方案3】：

这是另一种 data.table 解决方案/方法。 + 符号将 TRUE 和 FALSE 转换为 1 和 0。

library(data.table)

setDT(dt)[, first := +(date == min(date)), by=ID]

#        ID       date   type first
#  1:     1 2012-03-21      A     1
#  2:     1 2012-03-21      C     1
#  3:     1 2013-02-11      B     0
#  4:     1 2013-11-19      A     0
#  5:     1 2013-11-19      B     0
#  6:     2 2012-01-03      C     1
#  7:     2 2012-01-14      C     0
#  8:     2 2013-04-07      A     0
#  9:     2 2013-04-07      B     0
# 10:     2 2013-04-07      C     0
# 11:     2 2014-05-17      C     0
# 12:     3 2015-09-23      A     1
# 13:     3 2015-12-14      A     0
# 14:     4 2014-07-09      C     1
# 15:     5 2012-02-28      C     1
# 16:     5 2012-02-28      C     1
# 17:     6 2013-04-19      C     1
# 18:     6 2013-04-19      B     1
# 19:     6 2014-10-05      B     0
# 20:     6 2014-10-05      A     0

【讨论】：