【问题标题】:R: generate new var based on index numbers [i] within input var nameR:根据输入变量名称中的索引号 [i] 生成新变量
【发布时间】:2021-03-15 05:28:48
【问题描述】:

我有一个数据集,其中包含每个 ID 的一系列日期。我已经生成了一系列超前和滞后变量,现在我想生成另一组变量,其中包含每行中超前和滞后变量之间的天数差异。当我生成超前和滞后变量时,我使用 paste0 为每个变量名称附加一个数字。例如,滞后变量被命名为 prev_date1:prev_date20。我希望能够使用这些数字生成另一组变量来计算对之间的天数差异。一般形式由下式给出: diff2prev[i] = prev_date[i-1] - prev_date[i]

但我不知道如何在实践中实现这一点。在我最初的方法中,我只有 7 个变量并将它们分别写出来(包括此示例代码),但现在我需要生成 7 个以上的变量,因此我想找到一种更有效的方法来做到这一点。如示例中所示,我已尝试使用 data.table 和 dplyr,但到目前为止都没有。任何关于我哪里出错以及如何改进我的代码的指针将不胜感激。


if (!require('pacman')) install.packages('pacman'); library(pacman) 
#> Loading required package: pacman
p_load("dplyr", "lubridate","tidyverse")

id <- c(13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15)

date <- c("2017-06-06", "2017-07-26", "2017-09-22", "2017-10-21", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2018-03-29", "2019-05-12", "2019-06-07", "2019-10-08","2016-10-20", "2016-10-20", "2016-10-20", "2016-10-20", "2018-01-06", "2018-01-06", "2018-01-06", "2018-01-06", "2018-01-06","2018-01-06", "2018-05-02", "2018-08-04", "2018-08-04", "2018-08-04", "2018-11-22", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2018-12-26", "2019-05-11","2019-06-04", "2019-11-18", "2016-04-01", "2018-04-04", "2019-04-03", "2019-04-04", "2019-04-04", "2019-04-04", "2019-04-04","2019-04-04", "2019-04-04", "2019-04-04", "2019-04-04", "2019-04-04", "2019-04-04", "2019-04-04", "2019-06-03", "2019-06-04", "2019-11-23")

sample <- bind_cols(id, date)
#> New names:
#> * NA -> ...1
#> * NA -> ...2

colnames <- c("id", "date")

names(sample) <- colnames

sample <- sample %>% 
  group_by(id) %>% 
  mutate(date = as_date(date))

#Using data.table shift/lag to create 20 prev dates

sample[,paste0('prev_date', 1:20) := shift(date, 1:20, type="lag"), by = id][]


#Using data.table shift/lead to create 20 prev dates
       
sample[,paste0('next_date', 1:20) := shift(date, 1:20, type="lead"), by = id][]  

这是我目前尝试过的

## Dplyr approach to writing out each new variable
##This works but seems inefficient
sample <- sample %>%
  group_by(id) %>%
  mutate(diff2prev = date - prev_date,
       diff2prev1 = prev_date - prev_date1,
       diff2prev2 = prev_date1 - prev_date2,
       diff2prev3 = prev_date2 - prev_date3,
       diff2prev4 = prev_date3 - prev_date4,
       diff2prev5 = prev_date4 - prev_date5,
       diff2prev6 = prev_date5 - prev_date6,
       diff2prev7 = prev_date6 - prev_date7,
       diff2next = next_date - date,
       diff2next1 = next_date1 - next_date,
       diff2next2 = next_date2 - next_date1,
       diff2next3 = next_date3 - next_date2,
       diff2next4 = next_date4 - next_date3,
       diff2next5 = next_date5 - next_date4,
       diff2next6 = next_date6 - next_date5,
       diff2next7 = next_date7 - next_date6)

##Attempt at using data.table to generate variables but not sure how to incorporate the length of [i] for iteration
setDT(pid_ell)[,paste0('diff2prev', 1:20) := (diff2prev[i] = prev_date[i-1] - prev_date[i], 1:20), by = id][]

##Attempt to create a function that would create the new empty variables and then fill them in
#function to create variable calculating the difference in days to the previous date
fn_diff2prev <- function(date, prev_date) {
  for (i in 2:lead_lag){
    diff2prev[i] <- paste0('diff2prev', 1:20) # new var names
  }
    diff2prev1 <- date - prev_date1 #first one calculates from date
  for (i in 2:lead_lag){
    diff2prev[i] <- prev_date[i-1] - prev_date[i] #others calculate based on [i]
  }
    return
}

【问题讨论】:

    标签: r date data.table dplyr


    【解决方案1】:

    为什么不先计算diff(date),然后再计算shift

    sample[,c(paste0('date2prev', 1:20), paste0('date2next', 1:20)) := {
      days = c(NA, diff(date))
      c(shift(days, 0:19), shift(days, -1:-20)) 
    }, by = id]
    

    这里是输出的概述

        id prev_date2 prev_date1       date next_date1 next_date2 date2prev1 date2prev2 date2next1 date2next2
     1: 13       <NA>       <NA> 2017-06-06 2017-07-26 2017-09-22         NA         NA         50         58
     2: 13       <NA> 2017-06-06 2017-07-26 2017-09-22 2017-10-21         50         NA         58         29
     3: 13 2017-06-06 2017-07-26 2017-09-22 2017-10-21 2018-03-29         58         50         29        159
     4: 13 2017-07-26 2017-09-22 2017-10-21 2018-03-29 2018-03-29         29         58        159          0
     5: 13 2017-09-22 2017-10-21 2018-03-29 2018-03-29 2018-03-29        159         29          0          0
     6: 13 2017-10-21 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0        159          0          0
     7: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0          0          0          0
     8: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0          0          0          0
     9: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0          0          0          0
    10: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0          0          0          0
    11: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0          0          0          0
    12: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0          0          0          0
    13: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2018-03-29          0          0          0          0
    14: 13 2018-03-29 2018-03-29 2018-03-29 2018-03-29 2019-05-12          0          0          0        409
    15: 13 2018-03-29 2018-03-29 2018-03-29 2019-05-12 2019-06-07          0          0        409         26
    16: 13 2018-03-29 2018-03-29 2019-05-12 2019-06-07 2019-10-08        409          0         26        123
    17: 13 2018-03-29 2019-05-12 2019-06-07 2019-10-08       <NA>         26        409        123         NA
    18: 13 2019-05-12 2019-06-07 2019-10-08       <NA>       <NA>        123         26         NA         NA
    19: 14       <NA>       <NA> 2016-10-20 2016-10-20 2016-10-20         NA         NA          0          0
    20: 14       <NA> 2016-10-20 2016-10-20 2016-10-20 2016-10-20          0         NA          0          0
    

    【讨论】:

      【解决方案2】:

      我认为您已经很好地获得了sample data.table 设置的结构。您在示例中错过了library(data.table)setDT(sample),但我认为它必须是data.table,否则shift 函数将不起作用。我会保持差异的设置简单并使用循环。我会使用data.table::set 函数,因为它使这些循环操作变得非常简单,例如

      N <- 20
      # define the column you want to set and the x & y such that the difference
      # is x - y
      ncols <- paste0("diff2prev", 1:N)
      x_cols <- c("date", paste0("prev_date", 1:(N-1)))
      y_cols <- paste0("prev_date", 1:N)
      
      #eg
      ncols[4]
      x_cols[4]
      y_cols[4]
      
      # loop and use data.table set
      for(i in 1:N){
        set(sample,
            j = ncols[i],
            value = sample[[x_cols[i]]] - sample[[y_cols[i]]])
      }
      

      【讨论】:

        【解决方案3】:

        这是一个矢量化操作,因此您可以创建要减去的变量,然后减去两个相同大小的数据帧。

        library(data.table)
        #Create data.table
        sample <- data.table(id, date)
        #Create next and previous dates
        sample[,paste0('prev_date', 1:5) := shift(date, 1:5, type="lag"), by = id][]
        sample[,paste0('next_date', 1:5) := shift(date, 1:5, type="lag"), by = id][]
        
        #Create vectors of next and previous column names along with "date" column
        p1 <- c('date', grep('prev_date', names(sample), value = TRUE))
        n1 <- c('date', grep('next_date', names(sample), value = TRUE))
        
        #Create new columns for the dataframe
        new_p1 <- paste0('new_prev', seq_along(p1[-1]))
        new_n1 <- paste0('new_next', seq_along(n1[-1]))
        
        #Convert to dataframe
        setDF(sample)
        
        #Perform subtract of the columns. 
        sample[new_p1] <- sample[p1[-length(p1)]] - sample[p1[-1]]
        sample[new_n1] <- sample[n1[-1]] - sample[n1[-length(n1)]]
        sample
        

        【讨论】:

          猜你喜欢
          • 2021-12-03
          • 1970-01-01
          • 1970-01-01
          • 2022-08-15
          • 2012-08-23
          • 2012-10-24
          • 1970-01-01
          • 1970-01-01
          • 2015-02-03
          相关资源
          最近更新 更多