【发布时间】:2023-04-04 15:35:01
【问题描述】:
我正在处理一个名为 orderFlow 的 data.table 并计算 potentialWelfare.tmp 作为输出。到目前为止,以下基于 plyr 的方法是我的解决方案,但由于输入 orderFlow 有数百万行,我更喜欢利用 R 中 data.table 性能的解决方案。
# solution so far, poor performance on huge orderFlow input data.table
require(plyr)
potentialWelfare.tmp = ddply(orderFlow,
.variables = c("simulationrun_id", "db"),
.fun = calcPotentialWelfare,
.progress = "text",
.parallel=TRUE)
Edit1:简而言之,自定义函数会检查 df 中是否有更多出价或要价,并将 NbAsk 排序(按估值)出价的估值相加。这样做是为了选择最有价值的出价并总结其估值。该代码是遗留代码,可能效率不高,但它与 plyr 和普通 data.frames 结合使用。
calcPotentialWelfare <- function(df){
NbAsks = dim(df[df$type=="ask",])[1]
# print(NbAsks)
Bids = df[df$type == "bid",]
# dd[with(dd, order(-z, b)), ]
Bids = Bids[with(Bids,order(valuation,decreasing = TRUE)),]
NbBids = dim(df[df$type == "bid",])[1]
# print(Bids)
if (NbAsks > 0){
Bids = Bids[1:min(NbAsks,NbBids),]
potentialWelfare = sum(Bids$valuation)
return(potentialWelfare)
}
else{
potentialWelfare = 0
return(potentialWelfare)
}
}
不幸的是,我找不到使用 data.table 实现此功能的可行方法。到目前为止,我使用 ?data.table 和相应的常见问题解答是这样的:
# trying to use data.table, but it doesn't work so far.
potentialWelfare.tmp = orderFlow[, lapply(.SD, calcPotentialWelfare), by = list(simulationrun_id, db),.SDcols=c("simulationrun_id", "db")]
我得到的是
Error in `[.data.frame`(orderFlow, , lapply(.SD, calcPotentialWelfare), : unused arguments (by = list(simulationrun_id, db), .SDcols = c("simulationrun_id", "db"))
这是输入:
> head(orderFlow)
type valuation price dateCreation dateDue dateMatched id
1 ask 0.30000000 0.3 2012-01-01 00:00:00.000000 2012-01-01 00:30:00.000000 2012-01-01 00:01:01.098307 1
2 bid 0.39687633 0.0 2012-01-01 00:01:01.098307 2012-01-01 00:10:40.024807 2012-01-01 00:01:01.098307 2
3 bid 0.96803384 NA 2012-01-01 00:03:05.660811 2012-01-01 00:06:26.368941 <NA> 3
4 bid 0.06163186 NA 2012-01-01 00:05:25.413959 2012-01-01 00:09:06.189893 <NA> 4
5 bid 0.57017143 NA 2012-01-01 00:10:10.344876 2012-01-01 00:57:58.998516 <NA> 5
6 bid 0.37188442 NA 2012-01-01 00:11:25.761372 2012-01-01 00:43:24.274176 <NA> 6
created_at updated_at simulationrun_id db
1 2013-12-10 14:37:29.065634 NA 7004 1
2 2013-12-10 14:37:29.065674 NA 7004 1
3 2013-12-10 14:37:29.065701 NA 7004 1
4 2013-12-10 14:37:29.065726 NA 7004 1
5 2013-12-10 14:37:29.065750 NA 7004 1
6 2013-12-10 14:37:29.065775 NA 7004 1
我期待这样的输出,即函数 calcPotentialWelfare 以某种特殊的方式从 data.table orderFlow 的“评估”列聚合数据。
> head(potentialWelfare.tmp)
simulationrun_id db potentialWelfare
1 1 1 16.86684
2 2 1 18.44314
3 4 1 16.86684
4 5 1 18.44314
5 7 1 16.86684
6 8 1 18.44314
很高兴看到这个问题得到解决。 感谢阅读!
编辑2:
> dput(head(orderFlow))
structure(list(type = c("ask", "bid", "bid", "bid", "bid", "bid"
), valuation = c(0.3, 0.39687632952068, 0.968033835246625, 0.0616318564942726,
0.570171430446081, 0.371884415116724), price = c(0.3, 0, NA,
NA, NA, NA), dateCreation = c("2012-01-01 00:00:00.000000", "2012-01-01 00:01:01.098307",
"2012-01-01 00:03:05.660811", "2012-01-01 00:05:25.413959", "2012-01-01 00:10:10.344876",
"2012-01-01 00:11:25.761372"), dateDue = c("2012-01-01 00:30:00.000000",
"2012-01-01 00:10:40.024807", "2012-01-01 00:06:26.368941", "2012-01-01 00:09:06.189893",
"2012-01-01 00:57:58.998516", "2012-01-01 00:43:24.274176"),
dateMatched = c("2012-01-01 00:01:01.098307", "2012-01-01 00:01:01.098307",
NA, NA, NA, NA), id = 1:6, created_at = c("2013-12-10 14:37:29.065634",
"2013-12-10 14:37:29.065674", "2013-12-10 14:37:29.065701",
"2013-12-10 14:37:29.065726", "2013-12-10 14:37:29.065750",
"2013-12-10 14:37:29.065775"), updated_at = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), simulationrun_id = c(7004L,
7004L, 7004L, 7004L, 7004L, 7004L), db = c(1L, 1L, 1L, 1L,
1L, 1L)), .Names = c("type", "valuation", "price", "dateCreation",
"dateDue", "dateMatched", "id", "created_at", "updated_at", "simulationrun_id",
"db"), row.names = c(NA, 6L), class = "data.frame")
【问题讨论】:
-
.SDcols是应用函数的列。您正在将其应用于您在ddply等效项中用作 id 变量的列。我认为,只需将其更改为您要应用该功能的列。没有看到你的函数在做什么,很难说。 -
你能解释一下你的函数试图做什么吗?因为目前,它似乎做了很多不是 data.table 方式的事情(例如:你正在做矢量扫描)。如果你解释一下你在做什么,你的函数可能会在速度和代码紧凑性方面有一些不错的改进。
-
@PeterLustig 你试过dplyr package吗?它比 plyr 更快,而且更多的是 SQL 风格,因此更直观。
-
@MartinBel:不,我没有调查过。感谢您的提示,我考虑了一下,但其中一部分乐趣是让 data.table 更舒服一些。
-
@MartínBel,我明白了,谢谢你的链接。我去看看。问题还在于
dplyr是为一组特定的操作而设计的。 (相对)更容易指出它的功能。data.table继承自data.frame,因为它做得更多......记录[..]的每一种用法组合会很详尽。但是,文档可以帮助您入门,example(data.table)让您对可以完成的魔法有了一个很好的了解。如果您有任何我们可以帮助您解决的问题,请告诉我们。
标签: r data.table plyr