【问题标题】:cumsum NAs and other condition Rcumsum NAs 和其他条件 R
【发布时间】:2020-01-31 00:45:16
【问题描述】:

我见过很多这样的问题,但无法弄清楚这个简单的问题。我不想折叠数据集。假设我有这个数据集:

library(tidyverse)
library(lubridate)
df <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b"),
                 starts = c("2011-09-18", NA,  "2014-08-08", "2016-09-18", NA, "2013-08-08", "2015-08-08", NA),
                 ends = c(NA, "2013-03-06", "2015-08-08", NA, "2017-03-06", "2014-08-08", NA, "2016-08-08"))
df$starts <- parse_date_time(df$starts, "ymd")
df$ends <- parse_date_time(df$ends, "ymd")
df

  group     starts       ends
1     a 2011-09-18       <NA>
2     a       <NA> 2013-03-06
3     a 2014-08-08 2015-08-08
4     a 2016-09-18       <NA>
5     a       <NA> 2017-03-06
6     b 2013-08-08 2014-08-08
7     b 2015-08-08       <NA>
8     b       <NA> 2016-08-08

期望的输出是:

  group     starts       ends epi
1     a 2011-09-18       <NA>   1
2     a       <NA> 2013-03-06   1
3     a 2014-08-08 2015-08-08   2
4     a 2016-09-18       <NA>   3
5     a       <NA> 2017-03-06   3
6     b 2013-08-08 2014-08-08   1
7     b 2015-08-08       <NA>   2
8     b       <NA> 2016-08-08   2

我在想这样的事情,但显然没有考虑没有NA的剧集

df <- df %>% 
  group_by(group) %>% 
  mutate(epi = cumsum(is.na(ends)))
df

我不确定如何将cumsum(is.na) 与条件if_else 合并。也许我走错路了?

任何建议都会很棒。

【问题讨论】:

    标签: r if-statement dplyr cumsum


    【解决方案1】:

    使用dplyr 的解决方案。假设您的数据框结构良好,每个开始总是有一个关联的结束记录。

    df2 <- df %>%
      group_by(group) %>%
      mutate(epi = cumsum(!is.na(starts))) %>%
      ungroup()
    df2
    # # A tibble: 8 x 4
    #   group starts              ends                  epi
    #   <fct> <dttm>              <dttm>              <int>
    # 1 a     2011-09-18 00:00:00 NA                      1
    # 2 a     NA                  2013-03-06 00:00:00     1
    # 3 a     2014-08-08 00:00:00 2015-08-08 00:00:00     2
    # 4 a     2016-09-18 00:00:00 NA                      3
    # 5 a     NA                  2017-03-06 00:00:00     3
    # 6 b     2013-08-08 00:00:00 2014-08-08 00:00:00     1
    # 7 b     2015-08-08 00:00:00 NA                      2
    # 8 b     NA                  2016-08-08 00:00:00     2
    

    【讨论】:

      【解决方案2】:

      一种选择是获取NArowSums 元素,用于列'starts'、'ends',按'group' 分组,从'epi' 中获取rleid

      library(dplyr)
      library(data.table)
      df %>% 
          mutate(epi =  rowSums(is.na(.[c("starts", "ends")]))) %>% 
          group_by(group) %>%
          mutate(epi = rleid(epi))
      # A tibble: 8 x 4
      # Groups:   group [2]
      #  group starts              ends                  epi
      #  <fct> <dttm>              <dttm>              <int>
      #1 a     2011-09-18 00:00:00 NA                      1
      #2 a     NA                  2013-03-06 00:00:00     1
      #3 a     2014-08-08 00:00:00 2015-08-08 00:00:00     2
      #4 a     2016-09-18 00:00:00 NA                      3
      #5 a     NA                  2017-03-06 00:00:00     3
      #6 b     2013-08-08 00:00:00 2014-08-08 00:00:00     1
      #7 b     2015-08-08 00:00:00 NA                      2
      #8 b     NA                  2016-08-08 00:00:00     2
      

      如果只考虑两列

      df %>% 
        group_by(group) %>%
        mutate(epi = rleid(is.na(starts) + is.na(ends)))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-06-02
        • 1970-01-01
        • 2016-04-20
        相关资源
        最近更新 更多