【问题标题】:How to create a sequence column based on sequences' starts and ends如何根据序列的开始和结束创建序列列
【发布时间】:2019-02-13 10:22:46
【问题描述】:

我有两列,其中包含有关序列开始和结束的信息。我想从中创建一个序列列,即每个序列从seq_start1 时开始,并在seq_start = 1 之后出现的第一行结束,其中seq_end = 1。我怎样才能用tidyverse 做到这一点?数据如下所示,其中seq 是预期输出。请注意,当 seq_end = 1seq_start = 1 在同一行中时,这会产生长度为 1 的序列。

structure(list(seq_start = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, 
NA, NA, 1, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 
NA, 1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, 
NA), seq_end = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 
1L, 1L, 1L, NA, NA, 1L, 1L, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L, 
1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA, NA, NA, 1L, 1L, NA, 
NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, NA, 1L, 1L, 
1L), seq = c(NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
NA, 3L, NA, NA, NA, NA, NA, NA, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
7L, 7L, 7L, 8L, NA, NA, NA, 9L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 10L, 10L, NA, NA, NA, NA, NA, NA, NA, 11L, 
NA)), .Names = c("seq_start", "seq_end", "seq"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -60L))

【问题讨论】:

  • 你能提供一个输出样本吗?
  • seq 的值从 5 跳到 7(绕过 6)有什么原因吗?根据您描述的逻辑,我不确定我是否理解它是如何工作的。
  • @Salman 输出样本在seq 列中提供
  • @Z.Lin 没有,那是我的错误(我已经更正了)

标签: r tidyverse seq


【解决方案1】:

这是一个大量使用dplyr 包的lag() 函数以及base 包中的cumsum() 来产生预期结果的解决方案。它可能不是最简洁的解决方案,但我认为它的理解相当直观:

d <- d %>%

  # new.seq.starts starts from 0, and increments by 1 every time seq_starts takes on 
  # the value 1, like this: 0, 0, 0, 1, 1, 1, 1, 2, 2, ...
  # Rows with the same new.seq.starts value are thus part of the same "run".
  mutate(new.seq.starts = cumsum(!is.na(seq_start))) %>%

  # group by each "run"
  group_by(new.seq.starts) %>%

  # any.ending.so.far counts whether there has been ANY seq_end == 1 within the run yet.
  # first.ending is TRUE only if it's the first row (within the run) to have an ending.
  mutate(any.ending.so.far = cumsum(!is.na(seq_end)),
         first.ending = any.ending.so.far == 1 &
           (is.na(lag(any.ending.so.far)) | lag(any.ending.so.far) < 1)) %>%
  ungroup() %>%

  # result keeps the new.seq.starts values only if there's no ending yet (i.e. 
  # any.ending.so.far == 0), or only just ended (first.ending == TRUE). Otherwise,
  # it takes on the value NA.
  mutate(result = ifelse(new.seq.starts > 0 &
                           (any.ending.so.far == 0 | first.ending),
                         new.seq.starts, NA)) %>%

  # Remove helper variables as they are no longer needed.
  select(-c(new.seq.starts, any.ending.so.far, first.ending))

> all.equal(d$seq, d$result)
[1] TRUE

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-09-22
    • 2023-03-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-06-20
    • 1970-01-01
    • 2019-10-31
    相关资源
    最近更新 更多