【问题标题】:how to sum every row within date range in variables如何对变量中日期范围内的每一行求和
【发布时间】:2020-04-29 03:51:09
【问题描述】:

df1:

library(tidyverse)
library(lubridate)
ex1 <- tibble(date = seq.Date(from = ymd('20200101'), length.out = 100, by = 'day'),
          a = rnorm(100, mean = 1, sd = 2),
          b = runif(100, min = 1, max = 2),
          c = rnorm(100, mean = 3, sd = 1),
          d = runif(100, min = 50, max = 60))

df2:

cal_c <- tibble(variable = c('a', 'b', 'c','d'),
                    start = c(ymd('20200101', '20200103', '20200203', '20200103')),
                    end = c(ymd('20200204', '20200405', '20200301', '20200401')),
                    total = c('NA', 'NA', 'NA', 'NA'))

我想根据 df1 在开始和结束的日期范围内计算 df2 中的每一行,比如 a$toal 在 '2020-1-1' 到 '2020-2-4' 之间,b$total 在 ' 2020-1-3' to '2020-4-5',有什么帮助,非常感谢。

【问题讨论】:

    标签: r tidyverse lubridate


    【解决方案1】:

    我们可以为cal_c 数据创建startend 日期序列,以长格式获取ex1 并加入。然后我们可以为每个variable sum value

    library(tidyverse)
    
    cal_c %>%
      mutate(date = map2(start, end, seq, by = 'day')) %>%
      unnest(date) %>%
      left_join(ex1 %>% pivot_longer(cols = -date, names_to = 'variable'),
                        by = c('variable', 'date')) %>%
       group_by(variable, start, end) %>%
       summarise(value = sum(value, na.rm = TRUE))
    
    #  variable start      end         value
    #  <chr>    <date>     <date>      <dbl>
    #1 a        2020-01-01 2020-02-04   34.3
    #2 b        2020-01-03 2020-04-05  136. 
    #3 c        2020-02-03 2020-03-01   79.5
    #4 d        2020-01-03 2020-04-01 4909. 
    

    【讨论】:

      【解决方案2】:

      基础 R 解决方案:

      cal_c$total <- sapply(split(cal_c, rownames(cal_c)), function(x){
        sum(ex1[((ex1$date  >= x$start) & (ex1$date <= x$end)), match(x$variable, names(ex1))])})
      

      【讨论】:

        【解决方案3】:

        使用data.table的选项:

        cal_c[, total :=
            ex1[cal_c, on=.(date>=start, date<=end), by=.EACHI,
                sum(.SD[[variable]])]$V1
            ]
        

        输出:

           variable      start        end      total
        1:        a 2020-01-01 2020-02-04   34.04780
        2:        b 2020-01-03 2020-04-05  135.40290
        3:        c 2020-02-03 2020-03-01   91.10271
        4:        d 2020-01-03 2020-04-01 4978.59884
        

        数据:

        set.seed(0L)
        library(data.table)
        ex1 <- data.table(date = seq.Date(from = as.IDate('20200101', format="%Y%m%d"), length.out = 100, by = 'day'),
            a = rnorm(100, mean = 1, sd = 2),
            b = runif(100, min = 1, max = 2),
            c = rnorm(100, mean = 3, sd = 1),
            d = runif(100, min = 50, max = 60))
        
        cal_c <- data.table(variable = c('a', 'b', 'c','d'),
            start = as.IDate(c('20200101', '20200103', '20200203', '20200103'), format="%Y%m%d"),
            end = as.IDate(c('20200204', '20200405', '20200301', '20200401'), format="%Y%m%d"))
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2018-03-31
          • 2018-12-29
          • 1970-01-01
          • 2022-11-17
          • 2013-11-25
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多