【问题标题】:Is there a way of splitting time and duration variable while leaving other variables unchanged in R?有没有办法拆分时间和持续时间变量,同时在 R 中保持其他变量不变?
【发布时间】:2020-02-19 22:15:37
【问题描述】:

我有一个数据集,现在我想在下午 12:00(中午)将其拆分为两个,即如果变量从 08:00-13:00 变为 08:00-12:00 和 12:00 -13:00 横跨两行。变量持续时间和累积总和需要相应更改,但其他变量应与原始变量相同(未更改)。

这应该适用于不同的 id 变量。

id = unchanged from row 1, just repeated
start = changed in both rows
end = changed in both rows
day = unchanged from row 1, just repeated
duration = changed in both rows
cumulative time = changed in both row

ORIGINAL DATAFILE
#Current dataframe
id<-c("m1","m1")
x<-c("2020-01-03 10:00:00","2020-01-03 19:20:00")
start<-strptime(x,"%Y-%m-%d %H:%M:%S")
y<-c("2020-01-03 16:00:00","2020-01-03 20:50:00")
end<-strptime(y,"%Y-%m-%d %H:%M:%S")
day<-c(1,1)
mydf<-data.frame(id,start,end,day)
# calculate duration and time
mydf$duration<-as.numeric(difftime(mydf$end,mydf$start,units = "hours"))
mydf$time<-c(cumsum(mydf$duration))

REQUIRED DATAFILE
#Required dataframe
id2<-c("m1","m1","m1")
x2<-c("2020-01-03 10:00:00","2020-01-03 12:00:00","2020-01-03 19:20:00")
start2<-strptime(x2,"%Y-%m-%d %H:%M:%S")
y2<-c("2020-01-03 12:00:00","2020-01-03 16:00:00","2020-01-03 20:50:00")
end2<-strptime(y2,"%Y-%m-%d %H:%M:%S")
day2<-c(1,1,1)
mydf2<-data.frame(id2,start2,end2,day2)
# calculate duration and time
mydf2$duration<-c(2,4,1.5)
mydf2$time<-c(2,6,7.5)

【问题讨论】:

    标签: r dataframe datetime


    【解决方案1】:

    好问题。因此,每行隐含地包含一个或两个间隔,因此您应该能够在每行上定义这些间隔,然后转为 long,但您不能使用间隔值进行转轴(还没有?)。所以,这是我的方法,它为每条线计算最多两个班次开始时间,然后在旋转后从下一个班次开始推断班次结束。内嵌评论。

    library(lubridate, warn.conflicts = FALSE)
    library(tidyverse)
    library(magrittr, warn.conflicts = FALSE)
    library(hablar, warn.conflicts = FALSE)
    
    (mydf <- tibble(
      id    = "m1",
      start = as_datetime(c("2020-01-03 10:00:00", "2020-01-03 19:20:00")),
      end   = as_datetime(c("2020-01-03 16:00:00", "2020-01-03 20:50:00")),
      day   = 1
    ))
    #> # A tibble: 2 x 4
    #>   id    start               end                   day
    #>   <chr> <dttm>              <dttm>              <dbl>
    #> 1 m1    2020-01-03 10:00:00 2020-01-03 16:00:00     1
    #> 2 m1    2020-01-03 19:20:00 2020-01-03 20:50:00     1
    
    (mydf2 <- 
        mydf %>% 
        # Assume the relevant noontime cutoff is on the same day as the start
        mutate(midday = 
                 start %>% as_date() %>% 
                 add(12 %>% hours()) %>% 
                 fit_to_timeline() %>% 
                 # No relevant midday if the shift doesn't include noon
                 na_if(not(. %within% interval(start, end)))) %>% 
    
        # Make an original row ID since there doesn't seem to be one, and we will need
        # to build intervals within the data stemming from each original row
        rownames_to_column("orig_shift") %>% 
    
        pivot_longer(cols = c(start, midday, end),
                     # The timestamps we have here will be treated as start times
                     values_to = "start",
                     # Drop rows that would exist due to irrelevant middays
                     values_drop_na = TRUE) %>% 
        select(-name) %>% 
    
        # Infer shift end times as the start of the next shift, within lines defined
        # by the original shifts
        group_by(orig_shift) %>% 
        arrange(start) %>% 
        mutate(end = lead(start)) %>% 
        ungroup() %>% 
    
        # Drop lines that represent the end of the last shift and not a full one
        drop_na() %>% 
    
        # Compute those durations and times (should times really be globally
        # cumulative? Also, your specified mydf2 seems to have an incorrect first time
        # value)
        mutate(duration = start %--% end %>% as.numeric("hours"),
               time = cumsum(duration)) %>% 
        select(id, start, end, day, duration, time))
    #> # A tibble: 3 x 6
    #>   id    start               end                   day duration  time
    #>   <chr> <dttm>              <dttm>              <dbl>    <dbl> <dbl>
    #> 1 m1    2020-01-03 10:00:00 2020-01-03 12:00:00     1      2     2  
    #> 2 m1    2020-01-03 12:00:00 2020-01-03 16:00:00     1      4     6  
    #> 3 m1    2020-01-03 19:20:00 2020-01-03 20:50:00     1      1.5   7.5
    

    reprex package (v0.3.0) 于 2019 年 10 月 23 日创建

    【讨论】:

    • 感谢您的回复。您的代码中出现当前错误:pivot_longer(., cols = c(start, midday, end), values_to = "start", 中的错误:找不到函数 "pivot_longer"
    • 需要下载 tidyr 的开发工具
    • 我认为 pivot_longer() 现在是发行版 tidyr 的一部分 - 不需要开发工具和/或从源代码构建。也许您只是没有从 0.8.3 更新 tidyr? 1.0 版本是最近发布的。
    • 我大约 5 分钟前更新了它,并且正在考虑同样的事情,非常感谢
    【解决方案2】:

    当您有许多不同日期的观察时,这是我的解决方案,适用于更一般的情况。逻辑如下。

    1. 首先,我使用12:00pm(中午)拆分器创建一个数据框。

    2. 接下来,我通过将数据框连接到初始行并将它们保存在单独的数据框中来确定应该拆分的行。

    3. 接下来,我复制行并创建split_rows

    4. 我从原始数据集中删除我拆分的行并加入正确的双行。

    library(dplyr)
    
    split_time_data =
      tibble(split_time = as.POSIXct(seq(0, 365*60*60*24, 60*60*24),
                              origin="2020-01-01 17:00:00")) %>%
      mutate(key = TRUE)# I use 17:00 to make it 12:00 EST, adjust for your purposes
    
    data_to_split =
      mydf %>%
      mutate(key = TRUE) %>%
      left_join(split_time_data) %>%
      filter(between(split_time, start, end)) %>%
      select(-key)
    
    library(lubridate)
    split_rows =
      data_to_split %>%
      rbind(data_to_split) %>%
      arrange(start) %>%
      group_by(start) %>%
      mutate(row_number =  row_number() ) %>%
      ungroup() %>%
      mutate(start = if_else(row_number == 1, start, split_time ),
             end = if_else(row_number == 1, split_time, end )) %>%
      select(-row_number, -split_time) %>%
      mutate(duration = hour(end) - hour(start) )
    
    mydf %>%
      anti_join(data_to_split) %>%
      full_join(split_rows) %>%
      arrange(start) %>%
      mutate(time = cumsum(duration) )
    

    输出

      id               start                 end day duration time
    1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00   1      2.0  2.0
    2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00   1      4.0  6.0
    3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00   1      1.5  7.5
    

    【讨论】:

      猜你喜欢
      • 2022-12-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-08-27
      • 1970-01-01
      • 2019-06-16
      • 2021-11-14
      • 2021-03-10
      相关资源
      最近更新 更多