使用 dplyr 计算具有多个观察值的事件之间的时间答案

【问题标题】：Calculate time between events with multiple observations using dplyr使用 dplyr 计算具有多个观察值的事件之间的时间
【发布时间】：2022-01-19 23:39:48
【问题描述】：

我有以下格式的数据：

    mydata <- data.frame(id=c(1,1,1,2,2,2,2),event=c(1,1,2,1,2,2,3), time=c(2,2,3,6,8,8,11))
                         
    mydata

  id event time
1  1     1    2
2  1     1    2
3  1     2    3
4  2     1    6
5  2     2    8
6  2     2    8
7  2     3   11

我希望计算每个事件之间的时间，但我遇到了麻烦，因为有些事件有多个观察值。结果列应如下所示：

  id event time event_dt
1  1     1    2        0
2  1     1    2        0
3  1     2    3        1
4  2     1    6        0
5  2     2    8        2
6  2     2    8        2
7  2     3   11        3

如果可能的话，我想使用 dplyr 来做到这一点。

【问题讨论】：

标签： r dplyr

【解决方案1】：

我想知道您是否可以尝试使用来自purrr 的map_dbl？在这里，您可以将给定的event 的time 减去对应于event - 1 的time。那些没有先前时间的将从NA 转换为零。这还假设事件编号是连续的，并且给定事件的时间相同。

library(tidyverse)

mydata %>%
  group_by(id) %>%
  mutate(event_dt = map_dbl(event, ~time[event == .x][1] - time[event == .x - 1][1])) %>%
  replace_na(list(event_dt = 0))

输出

     id event  time event_dt
  <dbl> <dbl> <dbl>    <dbl>
1     1     1     2        0
2     1     1     2        0
3     1     2     3        1
4     2     1     6        0
5     2     2     8        2
6     2     2     8        2
7     2     3    11        3

【讨论】：

【解决方案2】：

我添加了最短时间，以防您为给定事件多次（如果可能）。然后，您可以使用lag 捕获时间差，然后可以将其连接回原始数据帧。

library(tidyverse)

mydata %>%
  dplyr::group_by(id, event) %>%
  dplyr::mutate_at(vars("time"), min) %>% 
  dplyr::distinct() %>% 
  dplyr::ungroup(event) %>% 
  dplyr::mutate(event_dt = time - lag(time)) %>% 
  dplyr::left_join(., mydata, by = c("id", "event", "time")) %>% 
  tidyr::replace_na(., list(event_dt=0))

输出

# A tibble: 7 × 4
# Groups:   id [2]
     id event  time event_dt
  <dbl> <dbl> <dbl>    <dbl>
1     1     1     2        0
2     1     1     2        0
3     1     2     3        1
4     2     1     6        0
5     2     2     8        2
6     2     2     8        2
7     2     3    11        3

数据

mydata <- structure(list(
  id = c(1, 1, 1, 2, 2, 2, 2),
  event = c(1, 1, 2, 1, 2, 2, 3),
  time = c(2, 2, 3, 6, 8, 8, 11)
),
class = "data.frame",
row.names = c(NA,-7L))

【讨论】：

【解决方案3】：

计算每个唯一 id/event/time 组合之间的差异，然后将其合并回来：

mydata %>% 
  distinct(id, event, time) %>%
  group_by(id) %>%
  mutate(event_dt = c(0, diff(time))) %>%
  right_join(mydata)

#Joining, by = c("id", "event", "time")
## A tibble: 7 x 4
## Groups:   id [2]
#     id event  time event_dt
#  <dbl> <dbl> <dbl>    <dbl>
#1     1     1     2        0
#2     1     1     2        0
#3     1     2     3        1
#4     2     1     6        0
#5     2     2     8        2
#6     2     2     8        2
#7     2     3    11        3

【讨论】：