【问题标题】:Remove dates which are not continuous in the data in R删除 R 中数据中不连续的日期
【发布时间】:2016-06-09 00:29:45
【问题描述】:

我有一个数据框,我想过滤掉日期不连续的条目。换句话说,我正在查看连续日期的集群。

a %>% group_by(day) %>% summarise(count = n()) %>% mutate(day_dif = day - lag(day))

来源:本地数据框 [20 x 3]

          day count day_dif
       (date) (int)  (dfft)
1  2016-02-02    12 NA days
2  2016-02-03    80  1 days
3  2016-02-04   102  1 days
4  2016-02-05    97  1 days
5  2016-02-06   118  1 days
6  2016-02-07   115  1 days
7  2016-02-08     4  1 days
8  2016-02-20    13 12 days
9  2016-02-21   136  1 days
10 2016-02-22   114  1 days
11 2016-02-23   134  1 days
12 2016-02-24   126  1 days
13 2016-02-25   128  1 days
14 2016-02-26    63  1 days
15 2016-02-27   118  1 days
16 2016-03-06     1  8 days
17 2016-03-29    28 23 days
18 2016-04-03    18  5 days
19 2016-04-08    18  5 days
20 2016-04-27    23 19 days

在此,我想过滤掉日期不连续的条目。例如,2016-03-06、2016-03-29、2016-04-03 是需要删除的单日条目。我只寻找连续天数的条目。多天出现的条目。我正在寻找的理想输出是,

          day count day_dif  Cluster
       (date) (int)  (dfft)
1  2016-02-02    12 NA days     1
2  2016-02-03    80  1 days     1
3  2016-02-04   102  1 days     1
4  2016-02-05    97  1 days     1
5  2016-02-06   118  1 days     1
6  2016-02-07   115  1 days     1 
7  2016-02-08     4  1 days     1
8  2016-02-20    13 12 days     2
9  2016-02-21   136  1 days     2
10 2016-02-22   114  1 days     2
11 2016-02-23   134  1 days     2
12 2016-02-24   126  1 days     2
13 2016-02-25   128  1 days     2
14 2016-02-26    63  1 days     2
15 2016-02-27   118  1 days     2

其中 cluster 列指示日期集群,并且输出会删除单个日期。这里集群列中的 1 表示第一组日期,2 表示第二组日期。 If there are more than 3 continuous days, I want to consider as on cluster

我正在尝试通过使用滞后函数和所有方法来做到这一点。但没有太大的成功。有人可以帮我这样做吗?任何想法将不胜感激。

谢谢

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    我们可以使用rle 对行进行子集化

    i1 <- c(TRUE, a1$day_dif[-1] >=3)
    i2 <- inverse.rle(within.list(rle(i1), {values1 <- values
               values[values1 &lengths >3] <- FALSE
               values[!values1]<- TRUE}))
    a1$Cluster <- cumsum(i1)
    a1[i2,]
    #          day count day_dif Cluster
    #1  2016-02-02    12 NA days       1
    #2  2016-02-03    80  1 days       1
    #3  2016-02-04   102  1 days       1
    #4  2016-02-05    97  1 days       1
    #5  2016-02-06   118  1 days       1
    #6  2016-02-07   115  1 days       1
    #7  2016-02-08     4  1 days       1
    #8  2016-02-20    13 12 days       2
    #9  2016-02-21   136  1 days       2
    #10 2016-02-22   114  1 days       2
    #11 2016-02-23   134  1 days       2
    #12 2016-02-24   126  1 days       2
    #13 2016-02-25   128  1 days       2
    #14 2016-02-26    63  1 days       2
    #15 2016-02-27   118  1 days       2
    

    上面的代码也可以链式(%&gt;%

    a1 %>%
       mutate(i1 = c(TRUE, day_dif[-1] >=3))  %>%
       do(data.frame(., i2 = inverse.rle(within.list(rle(.$i1), {
                         values1 <- values
                         values[values1 & lengths >3] <- FALSE
                         values[!values1] <- TRUE
                          })))) %>%
       mutate(Cluster = cumsum(i1)) %>%
       filter(i2) %>% 
       select(-i1, -i2)
    #          day count day_dif Cluster
    #1  2016-02-02    12 NA days       1
    #2  2016-02-03    80  1 days       1
    #3  2016-02-04   102  1 days       1
    #4  2016-02-05    97  1 days       1
    #5  2016-02-06   118  1 days       1
    #6  2016-02-07   115  1 days       1
    #7  2016-02-08     4  1 days       1
    #8  2016-02-20    13 12 days       2
    #9  2016-02-21   136  1 days       2
    #10 2016-02-22   114  1 days       2
    #11 2016-02-23   134  1 days       2
    #12 2016-02-24   126  1 days       2
    #13 2016-02-25   128  1 days       2
    #14 2016-02-26    63  1 days       2
    #15 2016-02-27   118  1 days       2
    

    数据

    a <- structure(list(day = structure(c(16833, 16834, 16835, 16836, 
    16837, 16838, 16839, 16851, 16852, 16853, 16854, 16855, 16856, 
    16857, 16858, 16866, 16889, 16894, 16899, 16918), class = "Date"), 
    count = c(12L, 80L, 102L, 97L, 118L, 115L, 4L, 13L, 136L, 
    114L, 134L, 126L, 128L, 63L, 118L, 1L, 28L, 18L, 18L, 23L
    )), .Names = c("day", "count"), row.names = c("1", "2", "3", 
    "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
    "16", "17", "18", "19", "20"), class = "data.frame")
    
    a1 <- a %>%
            mutate(day_dif = day - lag(day))
    

    【讨论】:

      【解决方案2】:

      可能有更好的方法来处理第一个NA 值。在这里,我手动将其赋值为 0。然后,因为连续日期的差值为 1,所以您可以利用此属性创建一个布尔向量,然后使用 cumsum 获取结果。最后,您可以删除那些长度等于 1 的组。

      # Let the first NA equal to 0
      df[which(is.na(df), arr.ind=TRUE)] <- 0
      
      df %>% mutate(cluster=cumsum(day_dif !=1)) %>%
        group_by(cluster) %>% filter(length(cluster) > 1) %>% ungroup()
      
      # Source: local data frame [15 x 4]
      
      #          day count day_dif cluster
      #        (date) (int)  (dfft)   (int)
      # 1  2016-02-02    12  0 days       1
      # 2  2016-02-03    80  1 days       1
      # 3  2016-02-04   102  1 days       1
      # 4  2016-02-05    97  1 days       1
      # 5  2016-02-06   118  1 days       1
      # 6  2016-02-07   115  1 days       1
      # 7  2016-02-08     4  1 days       1
      # 8  2016-02-20    13 12 days       2
      # 9  2016-02-21   136  1 days       2
      # 10 2016-02-22   114  1 days       2
      # 11 2016-02-23   134  1 days       2
      # 12 2016-02-24   126  1 days       2
      # 13 2016-02-25   128  1 days       2
      # 14 2016-02-26    63  1 days       2
      # 15 2016-02-27   118  1 days       2
      

      【讨论】:

        猜你喜欢
        • 2014-12-07
        • 1970-01-01
        • 2020-12-29
        • 1970-01-01
        • 2015-06-14
        • 1970-01-01
        • 2021-12-09
        • 2016-08-17
        • 2019-08-11
        相关资源
        最近更新 更多