使用 R data.table 的表驱动评估答案

【问题标题】：Table-driven evaluations using R data.table使用 R data.table 的表驱动评估
【发布时间】：2016-07-04 01:40:23
【问题描述】：

构建和评估包含各种条件的表格以针对数据集进行评估的最佳方法是什么？

例如，假设我想识别数据集中的无效行，如下所示：

library("data.table")

# notional example -- some observations are wrong, some missing
set.seed(1)
n = 100 # Number of customers.
        # Also included are "non-customers" where values except cust_id should be NA.
cust <- data.table( cust_id = sample.int(n+1),
                    first_purch_dt =
                      c(sample(as.Date(c(1:n, NA), origin="2000-01-01"), n), NA),
                    last_purch_dt = 
                      c(sample(as.Date(c(1:n, NA), origin="2000-04-01"), n), NA),
                    largest_purch_amt = 
                      c(sample(c(50:100, NA), n, replace=TRUE), NA),
                    last_purch_amt = 
                      c(sample(c(1:65,NA), n, replace=TRUE), NA)
                    )
setkey(cust, cust_id)

我要检查每个观察的错误是 last_purch_dt < first_purch_dt 或 largest_purch_amt < last_purch_amt 的任何出现，以及除 all 或 none 之外的任何缺失值。（对于非购买者来说，所有缺失都可以。）

我只想在条件表中store the expressions as strings，而不是一系列硬编码的表达式（它变得非常长且难以记录/维护）：

checks <- data.table( cond_id = c(1L:3L),
                      cond_txt = c("last_purch_dt < first_purch_dt",
                                  "largest_purch_amt < last_purch_amt",
                                  paste("( is.na(first_purch_dt) + is.na(last_purch_dt) +",
                                          "is.na(largest_purch_amt) +",
                                          "is.na(last_purch_amt) ) %% 4 != 0") # hacky XOR  
                                  ),
                      cond_msg = c("Error: last purchase prior to first purchase.",
                                   "Error: largest purchase less than last purchase.",
                                   "Error: partial transaction record.")
                     )

我知道我可以循环遍历条件行和rbindlist 生成的子集，例如：

err_obs <- 
  rbindlist(
    lapply(1:nrow(checks), function(i) {
      err_set <- cust[eval( parse(text= checks[i,cond_txt]) ) ,  ]
      cbind(err_set, 
            checks[i, .(err_id = rep.int(cond_id, times = nrow(err_set)),
                        err_msg = rep.int(cond_msg, times = nrow(err_set))
                        )]
            )                
    } )
  )
print(err_obs) # returns desired result

这似乎可以在评估中正确处理NAs。

当我说“什么是最好的方法”时，我是在问：

这是最好的方法，还是有比rbindlist(lapply(...) 更有效或更惯用的替代方法？
我目前的方法是否存在缺陷？
是否可以将其写为合并或连接，例如 cust inner join checks on eval(checks.condition(cust.values)) == TRUE？

【问题讨论】：

标签： r data.table

【解决方案1】：

这就是我的做法：

checks[, cust[eval(parse(text = cond_txt), .SD)][, err_msg := cond_msg], by = cond_id]

上面唯一重要的部分是 .SD 的存在 - 请参阅 this question 以获得解释。

【讨论】：

这行得通，非常感谢。我只需要几分钟就能理解为什么。如果您同意，我可能会添加一个简短的解释。
当然，随意
如果您不介意快速澄清一下，为什么需要[, err_msg := cond_msg]？换句话说，为什么采取这一步骤会完全放弃cond_msg，而不是只保留它而不重命名它？是不是因为那个时候的环境还是cust，我们必须从checks显式返回？
回想一下，checks[, j, by = cond_id] 只是为每个 cond_id 计算 j。并且j=cust[...] 评估为没有那个错误/条件消息列的data.table，所以你需要添加它。以上是一种方法。另一个应该是by = .(cond_id, cond_msg)。
谢谢，这个答案确实展示了data.table 的力量。我完全不确定我是否能够使用dplyr::inner_join 或其他方法来做到这一点。我想我必须做一个笛卡尔积，然后过滤eval()。