【问题标题】:Complex Reshaping of Data Frame, tracking record edits数据框的复杂重塑,跟踪记录编辑
【发布时间】:2019-11-21 16:13:21
【问题描述】:

我有一个数据框,用于跟踪项目记录的编辑时间、编辑时间、编辑的字段、旧值和新值。 OrderDateProbabilityTotal 列显示这些字段今天的值:

df.raw <- data.frame(project=rep(c('A','B'), each=4),
               createDate=as.Date(rep(c('2015-01-01','2017-05-01'), each=4)),
               editDate=as.Date(c('2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')), 
               editField=c('OrderDate', 'OrderDate','Probability','Probability', 'Total','Total', 'Probability','Total'),
               oldValue=c('2018-06-01','2019-05-01',20,30,500,550,70,400),
               newValue=c('2019-05-01','2019-06-01',30,50,550,400,30,450),
               OrderDate=as.Date(rep(c('2019-06-01','2019-01-01'), each=4)),
               Probability=rep(c(50,70), each=4),
               Total=rep(c(10,450), each=4))

  project createDate   editDate   editField   oldValue   newValue  OrderDate Probability Total
1       A 2015-01-01 2018-06-01   OrderDate 2018-06-01 2019-05-01 2019-06-01          50    10
2       A 2015-01-01 2019-04-01   OrderDate 2019-05-01 2019-06-01 2019-06-01          50    10
3       A 2015-01-01 2019-05-01 Probability         20         30 2019-06-01          50    10
4       A 2015-01-01 2019-06-01 Probability         30         50 2019-06-01          50    10
5       B 2017-05-01 2018-10-01       Total        500        550 2019-01-01          70   450
6       B 2017-05-01 2018-11-01       Total        550        400 2019-01-01          70   450
7       B 2017-05-01 2018-11-15 Probability         70         30 2019-01-01          70   450
8       B 2017-05-01 2019-01-01       Total        400        450 2019-01-01          70   450

我想转换这个数据框,以便:

  • 还有一行用于创建项目。
  • 每一行显示项目的 OrderDate、Probability 和 Total 在创建或编辑时的值。
  • 如果从未编辑过某个字段,则该字段将始终等于该项目的最终OrderDateProbabilityTotal 值。

最终结果将如下所示:

df.reshaped <- data.frame(project=rep(c('A','B'), each=5),
           editDate=as.Date(c('2015-01-01','2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2017-05-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')),
           editField=c('Created','OrderDate', 'OrderDate','Probability','Probability','Created', 'Total','Total', 'Probability','Total'),
           OrderDateAtEdit=as.Date(c('2018-06-01','2019-05-01','2019-06-01','2019-06-01','2019-06-01',rep('2019-01-01', 5))),
           ProbabilityAtEdit=c(20,20,20,30,50,70,70,70,30,30),
           TotalAtEdit=c(10,10,10,10,10,500,550,400,400,450))
   project   editDate   editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
1        A 2015-01-01     Created      2018-06-01                20          10
2        A 2018-06-01   OrderDate      2019-05-01                20          10
3        A 2019-04-01   OrderDate      2019-06-01                20          10
4        A 2019-05-01 Probability      2019-06-01                30          10
5        A 2019-06-01 Probability      2019-06-01                50          10
6        B 2017-05-01     Created      2019-01-01                70         500
7        B 2018-10-01       Total      2019-01-01                70         550
8        B 2018-11-01       Total      2019-01-01                70         400
9        B 2018-11-15 Probability      2019-01-01                30         400
10       B 2019-01-01       Total      2019-01-01                30         450

我不知道从哪里开始,任何帮助将不胜感激!谢谢。

【问题讨论】:

    标签: r data.table reshape2


    【解决方案1】:

    我认为数据已经合并在一起,您需要将它们拆分为事件表和编辑表:

    library(data.table)
    setDT(df.raw)
    
    #created the events table with the available values first
    cols <- c("OrderDate", "Probability", "Total")
    events <- df.raw[, setnames(rbindlist(.(.(createDate[1L], "Created"), 
        .(editDate, editField))), c("editDate","editField")), project]
    events[unique(df.raw, by=c("project", "Probability", "Total")), on=.(project), 
        paste0(cols, "AtEdit") := lapply(mget(cols), as.character)]
    
    #historical edits in another table
    edits <- df.raw[, .(startDate=c(createDate[1L], editDate), 
        endDate=c(editDate, as.Date("9999-12-31")),
        value=c(oldValue, newValue[.N])), .(project, editField)]
    
    #perform non-equi joins to update events table
    for (x in cols) {
        cn <- paste0(x, "AtEdit")
        v <- edits[editField==x][events, on=.(project, startDate<=editDate, endDate>editDate), value] 
        events[, (cn) := fifelse(is.na(v), get(cn), as.character(v))]  
    }
    

    输出:

        project   editDate   editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
     1:       A 2015-01-01     Created      2018-06-01                20          10
     2:       A 2018-06-01   OrderDate      2019-05-01                20          10
     3:       A 2019-04-01   OrderDate      2019-06-01                20          10
     4:       A 2019-05-01 Probability      2019-06-01                30          10
     5:       A 2019-06-01 Probability      2019-06-01                50          10
     6:       B 2017-05-01     Created      2019-01-01                70         500
     7:       B 2018-10-01       Total      2019-01-01                70         550
     8:       B 2018-11-01       Total      2019-01-01                70         400
     9:       B 2018-11-15 Probability      2019-01-01                30         400
    10:       B 2019-01-01       Total      2019-01-01                30         450
    

    【讨论】:

    • #perform non-equi joins to update events table 我得到这个错误:Error in fifelse(is.na(v), get(cn), v) : 'yes' is of type character but 'no' is of type integer. Please make sure that both arguments have the same type. 否则,很好。
    • @EricFrey 我添加了一个修复程序
    • 看起来我现在正在获取级别而不是日期。 OrderDateAtEdit 1: 2 2: 3 3: 2 4: 2 5: 2
    • 也许尝试先将您的 df.raw 列转换为字符?
    猜你喜欢
    • 2015-08-29
    • 2015-08-04
    • 1970-01-01
    • 2019-05-29
    • 2013-02-17
    • 1970-01-01
    • 2014-11-09
    相关资源
    最近更新 更多