【问题标题】:merge repetitive specific record and keep some value of the removed ones in R合并重复的特定记录并在 R 中保留一些已删除记录的值
【发布时间】:2023-03-20 18:05:01
【问题描述】:

我有一个庞大的数据集,其中包含四列 user_idactionstart_timeend_time。我想合并连续的动作"o"start_time 将第一个start_timeend_time 是最后一个合并记录的end_time
假设df

"user_id","action","start_time","end_time"
"11","o",23:25:27,23:25:49
"11","o",23:25:28,23:25:28
"11","o",23:25:48,23:26:50
"11","v",23:25:49,23:25:49
"11","v",23:25:49,23:25:50
"11","o",23:28:24,00:22:33
"11","o",00:10:48,00:23:44
"22","o",00:11:52,00:22:33
"22","o",00:22:32,00:27:44
"22","v",00:22:42,00:22:42
"22","o",00:22:42,00:22:42
"22","z",00:22:42,00:22:43

我想合并第 1 行、第 2 行和第 3 行,因为它们都有动作 "o" 并且合并有第一行的 start_time 和第二行的 end_time。这同样适用于行号 67 以及行号 89
所以想要的输出:

    "user_id","action","start_time","end_time"
    "11","o",23:25:27,23:26:50
    "11","v",23:25:49,23:25:49
    "11","v",23:25:49,23:25:50
    "11","o",23:28:24,00:23:44
    "22","o",00:11:52,00:27:44
    "22","v",00:22:42,00:22:42
    "22","o",00:22:42,00:22:42
    "22","z",00:22:42,00:22:43   

我如何在 R 中做到这一点? 谢谢

【问题讨论】:

  • 我认为您想要的输出中有错误。输入的第 4 行和第 5 行应该合并在一起,所以你想要的输出中的action 值序列应该是:o, v, o, o, v, o, z,我认为

标签: r loops merge


【解决方案1】:

如果您不介意data.table 解决方案,

library(data.table)
setDT(df)
df[, {
    if (action[1L]=="o") {
        .(start_time=start_time[1L], end_time=end_time[.N])
    } else {
        .(start_time, end_time)
    }
}, by=.(rleid(action), user_id, action)][, -1L]

#   user_id action start_time end_time
#1:      11      o   23:25:27 23:26:50
#2:      11      v   23:25:49 23:25:49
#3:      11      v   23:25:49 23:25:50
#4:      11      o   23:28:24 00:23:44
#5:      22      o   00:11:52 00:27:44
#6:      22      v   00:22:42 00:22:42
#7:      22      o   00:22:42 00:22:42
#8:      22      z   00:22:42 00:22:43

数据:

df <- read.csv(text='"user_id","action","start_time","end_time"
"11","o",23:25:27,23:25:49
"11","o",23:25:28,23:25:28
"11","o",23:25:48,23:26:50
"11","v",23:25:49,23:25:49
"11","v",23:25:49,23:25:50
"11","o",23:28:24,00:22:33
"11","o",00:10:48,00:23:44
"22","o",00:11:52,00:22:33
"22","o",00:22:32,00:27:44
"22","v",00:22:42,00:22:42
"22","o",00:22:42,00:22:42
"22","z",00:22:42,00:22:43')

【讨论】:

    【解决方案2】:

    我的流程:首先我们添加一个带有rleid 的运行长度编码ID,这将允许我们将操作视为单独的组。接下来我们添加两个临时列stet,表示每个组的开始和结束时间。然后,我们filter 并得到所有不是“o”的动作,或者如果它是“o”,我们采取第一个动作。然后在动作为“o”的组中,我们希望用临时列替换开始时间和结束时间。最后我们select 只列出了您想要用于决赛桌的列。

    这应该适用于“o”动作的多次长度运行。我确信有更好的方法来进行最后一次变异,但我想把它放在那里。

    library(data.table)
    library(dplyr)
    df  %>% 
      mutate(rlid = rleid(user_id,action)) %>% 
      group_by(rlid) %>% 
      mutate(st = start_time[row_number()==1], et = end_time[row_number = n()]) %>%
      filter(action!="o" | row_number()==1) %>% 
      mutate(start_time = case_when(action=="o"~st,
                                    action!="o"~start_time),
             end_time = case_when(action=="o"~et,
                                  action!="o"~end_time)) %>% 
      ungroup() %>% 
      select(user_id:end_time)
    
    # # A tibble: 8 x 4
    #   user_id action start_time end_time
    #     <int> <fct>  <fct>      <fct>   
    # 1      11 o      23:25:27   23:26:50
    # 2      11 v      23:25:49   23:25:49
    # 3      11 v      23:25:49   23:25:50
    # 4      11 o      23:28:24   00:23:44
    # 5      22 o      00:11:52   00:27:44
    # 6      22 v      00:22:42   00:22:42
    # 7      22 o      00:22:42   00:22:42
    # 8      22 z      00:22:42   00:22:43
    

    【讨论】:

    • 谢谢,但是当我在我的数据中运行时,end_time 是完全错误的。似乎它会生成随机日期时间!
    【解决方案3】:

    感谢@jasbner 建议data.table::rleid 的 99% tidyverse 解决方案:

    使用data.table::rleid,我们可以给每个顺序组一个唯一的ID。然后就是简单的按rlid分组,然后用summarize找出最早的start_time和最新的end_time。默认情况下,summary 会删除所有其他变量,因此您必须如下所示明确保留它们。最后,我们删除了 rlid 变量,使其与您的示例相匹配,但这可能会为将来保留。

    library(dplyr)
    library(data.table)
    
    df  %>% 
        mutate(rlid = data.table::rleid(user_id,action)) %>% 
        group_by(rlid) %>%
        summarize(user_id = user_id[1],
                  action = action[1],
                  start_time = min(start_time),
                  end_time = max(end_time)) %>%
        select(-rlid)
    
      user_id action start_time end_time
        <int> <chr>  <chr>      <chr>   
    1      11 o      23:25:27   23:26:50
    2      11 v      23:25:49   23:25:50
    3      11 o      00:10:48   00:23:44
    4      22 o      00:11:52   00:27:44
    5      22 v      00:22:42   00:22:42
    6      22 o      00:22:42   00:22:42
    7      22 z      00:22:42   00:22:43
    

    此方法将折叠任意数量的重复行,并且(我认为)比纯 data.table 方法更易于理解。

    【讨论】:

      猜你喜欢
      • 2015-09-27
      • 2012-04-12
      • 1970-01-01
      • 2019-12-20
      • 1970-01-01
      • 2013-12-30
      • 1970-01-01
      • 2020-05-14
      • 1970-01-01
      相关资源
      最近更新 更多