【问题标题】:Define sequences based on a variable run with additional condition from another variable基于变量运行定义序列,并使用来自另一个变量的附加条件
【发布时间】:2018-09-15 11:43:19
【问题描述】:
structure(list(group = c(NA, "A", "B", NA, "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", NA, NA, "B", "B", "A", "A", NA, NA, "B", "B", "B", NA, "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", NA, NA, "B", "B", 
NA, "A"), seq_break = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, 
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)), .Names = c("group", 
"seq_break"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-50L))

在上面的数据中,我需要定义一个包含group 列的运行长度类型ID 的列(如data.table::rleid 产生,但忽略NA)。正如你所看到的,我们还有seq_break 列,它应该结束一个序列。它通常会这样做,例如group = NA 然后seq_break = TRUE。但有时seq_break = TRUE 和组是AB - 那么即使下一行指的是同一个组,也应该结束序列并开始新的序列。因此,例如对于行25:26,我们应该有两个不同的序列ID,即使两个事件都指向组B。一般来说,预期的输出如下所示:

structure(list(group = c(NA, "A", "B", NA, "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", NA, NA, "B", "B", "A", "A", NA, NA, "B", "B", "B", NA, "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", NA, NA, "B", "B", 
NA, "A"), seq_break = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, 
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE), expected_output = c(NA, 
1, 2, NA, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, NA, NA, 4, 5, 6, 6, NA, NA, 7, 7, 7, NA, 8, 8, 8, 8, 8, 8, 
8, 8, 8, 8, NA, NA, 11, 11, NA, 12)), .Names = c("group", "seq_break", 
"expected_output"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-50L))

如何使用tidyverse 实现这一目标?

【问题讨论】:

    标签: r dplyr tidyverse


    【解决方案1】:

    使用tidyversedata.table 的解决方案。假设 dt1 是您的示例数据框,dt3 是最终输出。请注意,我认为在预期输出中,第 47 到 48 行应该是 9,第 50 行应该是 10。我不确定为什么在你的预期输出中第 47 到 48 行是 11,而第 50 行是 12。

    library(tidyverse)
    library(data.table)
    
    dt2 <- dt1 %>% rowid_to_column() 
    
    dt3 <- dt2 %>%
      mutate(ID = rleid(group, seq_break)) %>%
      group_by(group, seq_break, ID) %>%
      filter(!(is.na(group) & seq_break & row_number() > 1)) %>%
      ungroup() %>%
      mutate(ID2 = cumsum(seq_break)) %>%
      drop_na(group) %>%
      mutate(expected_output = rleid(group, ID2)) %>%
      select(rowid, expected_output) %>%
      left_join(dt2, ., by = "rowid") %>%
      select(-rowid)
    
    dt3
    # # A tibble: 50 x 3
    #    group seq_break expected_output
    #    <chr> <lgl>               <int>
    #  1 NA    TRUE                   NA
    #  2 A     FALSE                   1
    #  3 B     FALSE                   2
    #  4 NA    TRUE                   NA
    #  5 B     FALSE                   3
    #  6 B     FALSE                   3
    #  7 B     FALSE                   3
    #  8 B     FALSE                   3
    #  9 B     FALSE                   3
    # 10 B     FALSE                   3
    # # ... with 40 more rows
    

    【讨论】:

      猜你喜欢
      • 2011-04-13
      • 2022-11-23
      • 2019-01-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-11-19
      • 2023-01-20
      • 1970-01-01
      相关资源
      最近更新 更多