【发布时间】:2020-11-03 21:22:47
【问题描述】:
更新:我将代码缩减为关键元素以缩短它
function_impact_calc 非常慢(100000 条记录数据帧需要 26 秒)。我认为主要原因是 for 循环(也许 apply 或 map 会有所帮助?)。下面我模拟数据,编写impact_calc函数并记录运行时间。
library(dplyr)
library(data.table)
library(reshape2)
###########################################################
# Start Simulate Data
###########################################################
BuySell <- function(m = 40, s = 4) {
S <- pmax(round(rnorm(10, m, s), 2), 0)
S.sorted <- sort(S)
data.frame(buy = rev(head(S.sorted, 5)), sell = tail(S.sorted, 5))
}
number_sates <- 10000
lst <- replicate(number_sates, BuySell(), simplify = FALSE)
# assemble prices data frame
prices <- as.data.frame(data.table::rbindlist(lst))
prices$state_id <- rep(1:number_sates, each = 5)
prices$level <- rep(1:5, times = number_sates)
prices$quantities <- round(runif(number_sates * 5, 100000, 1000000), 0)
# reshape to long format
prices_long <- reshape2::melt(prices,
id.vars = c("state_id", "quantities", "level"),
value.name = "price"
) %>%
rename("side" = "variable") %>%
setDT()
###########################################################
# End Simulate Data
###########################################################
这是一个非常慢的函数 Impact_calc。
##########################################################
# function to optimize
impact_calc <- function(data, required_quantity) {
# get best buy and sell
best_buy <- data[, ,.SDcols = c("price", "side", "level")][side == "buy" & level == 1][1, "price"][[1]]
best_sell <- data[, ,.SDcols = c("price", "side", "level")][side == "sell" & level == 1][1, "price"][[1]]
# calculate mid
mid <- 0.5 * (best_buy + best_sell)
# buys
remaining_qty <- required_quantity
impact <- 0
data_buy <- data[, ,][side == "buy"]
levels <- data_buy[, ,][side == "buy"][, level]
# i think this for loop is slow!
for (level in levels) {
price_difference <- mid - data_buy$price[level]
if (data_buy$quantities[level] >= remaining_qty) {
impact <- impact + remaining_qty * price_difference
remaining_qty <- 0
break
} else {
impact <- impact + data_buy$quantities[level] * price_difference
remaining_qty <- remaining_qty - data_buy$quantities[level]
}
}
rel_impact <- impact / required_quantity / mid
return_list <- list("relative_impact" = rel_impact)
}
运行时的结果:
start_time <- Sys.time()
impact_buys <- prices_long[, impact_calc(.SD, 600000), by = .(state_id)]
end_time <- Sys.time()
end_time - start_time
# for 100000 data frame it takes
#Time difference of 26.54057 secs
感谢您的帮助!
【问题讨论】:
-
太长了;没读。然而,一个提示不是使用
reshape2或dplyr,而是使用data.table中的等效函数。 -
@sindri_baldur - 感谢您的评论!我减少了我的代码以使其更短。 reshape2 我仅用于模拟 data.frame 以使我的问题可重现。不应优化模拟数据部分。只有函数 Impact_calc。而且我认为 for 循环是瓶颈。
-
@sindri_baldur - 此外,我将 best_buy 和 best_sell 从 dplyr 过滤更改为 data.table,它将运行时间减少了 1/3!但在这个小数据集上仍然需要 12 秒
-
还将 Impact_calc 中所有剩余的 dplyr 元素更改为 data.table - 这会有所帮助。现在,我将 data.frame 增加到 100K 记录 - 需要 26 秒(for 循环很慢)
标签: r for-loop data.table