在 data.table 上应用自定义函数，而不是使用 plyr 和 ddply答案

【问题标题】：Applying a custom function on data.table instead of using plyr and ddply在 data.table 上应用自定义函数，而不是使用 plyr 和 ddply
【发布时间】：2023-04-04 15:35:01
【问题描述】：

我正在处理一个名为 orderFlow 的 data.table 并计算 potentialWelfare.tmp 作为输出。到目前为止，以下基于 plyr 的方法是我的解决方案，但由于输入 orderFlow 有数百万行，我更喜欢利用 R 中 data.table 性能的解决方案。

    # solution so far, poor performance on huge orderFlow input data.table
    require(plyr)
    potentialWelfare.tmp = ddply(orderFlow, 
                       .variables = c("simulationrun_id", "db"), 
                       .fun = calcPotentialWelfare, 
                       .progress = "text", 
                       .parallel=TRUE)

Edit1：简而言之，自定义函数会检查 df 中是否有更多出价或要价，并将 NbAsk 排序（按估值）出价的估值相加。这样做是为了选择最有价值的出价并总结其估值。该代码是遗留代码，可能效率不高，但它与 plyr 和普通 data.frames 结合使用。

    calcPotentialWelfare <- function(df){
       NbAsks = dim(df[df$type=="ask",])[1]
    #   print(NbAsks)
      Bids = df[df$type == "bid",]
    #         dd[with(dd, order(-z, b)), ]
      Bids = Bids[with(Bids,order(valuation,decreasing = TRUE)),]
      NbBids = dim(df[df$type == "bid",])[1]
    #   print(Bids)
      if (NbAsks > 0){
        Bids = Bids[1:min(NbAsks,NbBids),]
        potentialWelfare = sum(Bids$valuation)
        return(potentialWelfare)
      }
      else{
        potentialWelfare = 0
        return(potentialWelfare)
      }
    }

不幸的是，我找不到使用 data.table 实现此功能的可行方法。到目前为止，我使用 ?data.table 和相应的常见问题解答是这样的：

    #   trying to use data.table, but it doesn't work so far.
    potentialWelfare.tmp = orderFlow[, lapply(.SD, calcPotentialWelfare), by = list(simulationrun_id, db),.SDcols=c("simulationrun_id", "db")]

我得到的是

    Error in `[.data.frame`(orderFlow, , lapply(.SD, calcPotentialWelfare),  : unused arguments (by = list(simulationrun_id, db), .SDcols = c("simulationrun_id", "db"))

这是输入：

    > head(orderFlow)
      type  valuation price               dateCreation                    dateDue                dateMatched id
    1  ask 0.30000000   0.3 2012-01-01 00:00:00.000000 2012-01-01 00:30:00.000000 2012-01-01 00:01:01.098307  1
    2  bid 0.39687633   0.0 2012-01-01 00:01:01.098307 2012-01-01 00:10:40.024807 2012-01-01 00:01:01.098307  2
    3  bid 0.96803384    NA 2012-01-01 00:03:05.660811 2012-01-01 00:06:26.368941                       <NA>  3
    4  bid 0.06163186    NA 2012-01-01 00:05:25.413959 2012-01-01 00:09:06.189893                       <NA>  4
    5  bid 0.57017143    NA 2012-01-01 00:10:10.344876 2012-01-01 00:57:58.998516                       <NA>  5
    6  bid 0.37188442    NA 2012-01-01 00:11:25.761372 2012-01-01 00:43:24.274176                       <NA>  6
              created_at updated_at simulationrun_id db
    1 2013-12-10 14:37:29.065634         NA             7004  1
    2 2013-12-10 14:37:29.065674         NA             7004  1
    3 2013-12-10 14:37:29.065701         NA             7004  1
    4 2013-12-10 14:37:29.065726         NA             7004  1
    5 2013-12-10 14:37:29.065750         NA             7004  1
    6 2013-12-10 14:37:29.065775         NA             7004  1

我期待这样的输出，即函数 calcPotentialWelfare 以某种特殊的方式从 data.table orderFlow 的“评估”列聚合数据。

    > head(potentialWelfare.tmp)
      simulationrun_id db potentialWelfare
    1                1  1         16.86684
    2                2  1         18.44314
    3                4  1         16.86684
    4                5  1         18.44314
    5                7  1         16.86684
    6                8  1         18.44314

很高兴看到这个问题得到解决。感谢阅读！

编辑2：

    > dput(head(orderFlow))
    structure(list(type = c("ask", "bid", "bid", "bid", "bid", "bid"
    ), valuation = c(0.3, 0.39687632952068, 0.968033835246625, 0.0616318564942726, 
    0.570171430446081, 0.371884415116724), price = c(0.3, 0, NA, 
    NA, NA, NA), dateCreation = c("2012-01-01 00:00:00.000000", "2012-01-01 00:01:01.098307", 
    "2012-01-01 00:03:05.660811", "2012-01-01 00:05:25.413959", "2012-01-01 00:10:10.344876", 
    "2012-01-01 00:11:25.761372"), dateDue = c("2012-01-01 00:30:00.000000", 
    "2012-01-01 00:10:40.024807", "2012-01-01 00:06:26.368941", "2012-01-01 00:09:06.189893", 
    "2012-01-01 00:57:58.998516", "2012-01-01 00:43:24.274176"), 
        dateMatched = c("2012-01-01 00:01:01.098307", "2012-01-01 00:01:01.098307", 
        NA, NA, NA, NA), id = 1:6, created_at = c("2013-12-10 14:37:29.065634", 
        "2013-12-10 14:37:29.065674", "2013-12-10 14:37:29.065701", 
        "2013-12-10 14:37:29.065726", "2013-12-10 14:37:29.065750", 
        "2013-12-10 14:37:29.065775"), updated_at = c(NA_real_, NA_real_, 
        NA_real_, NA_real_, NA_real_, NA_real_), simulationrun_id = c(7004L, 
        7004L, 7004L, 7004L, 7004L, 7004L), db = c(1L, 1L, 1L, 1L, 
        1L, 1L)), .Names = c("type", "valuation", "price", "dateCreation", 
    "dateDue", "dateMatched", "id", "created_at", "updated_at", "simulationrun_id", 
    "db"), row.names = c(NA, 6L), class = "data.frame")

【问题讨论】：

.SDcols 是应用函数的列。您正在将其应用于您在 ddply 等效项中用作 id 变量的列。我认为，只需将其更改为您要应用该功能的列。没有看到你的函数在做什么，很难说。
你能解释一下你的函数试图做什么吗？因为目前，它似乎做了很多不是 data.table 方式的事情（例如：你正在做矢量扫描）。如果你解释一下你在做什么，你的函数可能会在速度和代码紧凑性方面有一些不错的改进。
@PeterLustig 你试过dplyr package吗？它比 plyr 更快，而且更多的是 SQL 风格，因此更直观。
@MartinBel：不，我没有调查过。感谢您的提示，我考虑了一下，但其中一部分乐趣是让 data.table 更舒服一些。
@MartínBel，我明白了，谢谢你的链接。我去看看。问题还在于dplyr 是为一组特定的操作而设计的。（相对）更容易指出它的功能。 data.table 继承自 data.frame，因为它做得更多......记录[..] 的每一种用法组合会很详尽。但是，文档可以帮助您入门，example(data.table) 让您对可以完成的魔法有了一个很好的了解。如果您有任何我们可以帮助您解决的问题，请告诉我们。

标签： r data.table plyr

【解决方案1】：

我认为这应该更快。您使用data.table 的方式存在一些错误。我建议您通读介绍、浏览示例并阅读常见问题解答。

calcPotentialWelfare <- function(dt){
  NbAsks = nrow(dt["ask", nomatch=0L]) # binary search based subset/join - very fast
  Bids   = dt["bid", nomatch=0L] # binary search based subset/join - very fast
  NbBids = nrow(Bids)
  # for each 'type', the 'valuation' will always be sorted, 
  # but in ascending order - but you need descending order
  # so you can just use the function 'tail' to fetch the last 'n' items... as follows.
  if (NbAsks > 0) return(sum(tail(Bids, min(NbAsks, NbBids))$valuation))
  else return(0)
}

# setkey on 'type' column to use binary search based subset/join in the function
# also on valuation so that we don't have to 'order' for every group 
# inside the function - we can use 'tail'
setkey(orderFlow, type, valuation) 
potentialWelfare.tmp =
  orderFlow[, calcPotentialWelfare(.SD), 
            by=.(simulationrun_id, db),
            .SDcols=c("type", "valuation")]

.SD 是一个特殊变量，它为每个分组创建一个 data.table，其中包含 by= 中未提及的所有列（如果未指定 .SDcols）。如果指定了.SDcols，则为每个组创建.SD，只指定那些列，并带有对应于该组的数据。

使用lapply(.SD, ...) 为函数提供每一列，这不是您需要的。您需要将整个数据发送到函数。但是，由于您只需要函数中的“类型”和“评估”列，因此可以通过提供 .SDcols=c('type', 'valuation') 来加快速度。通过忽略其他列，这将节省大量时间。

【讨论】：

谢谢！关键点是我不了解 .SDcols 的真正作用。现在，这已经很清楚了。如果我没记错的话，整个 .SDcols 参数可能会被忽略，相应的计算性能会下降，对吧？
没错。 .SDcols 不是绝对要求。这是为了加快操作。想象一个 100 col 的 DT，其中需要在 2 cols 上执行操作。我们不必为所有 100 个构造 .SD。