【发布时间】:2020-03-29 19:39:54
【问题描述】:
我有一个大数据框,我想根据两个不同的 id 聚合它。不同的列有不同的聚合规则,我想写一个紧凑的代码来做聚合(最终数据集中还有很多我不需要的无用变量)。我做了一个玩具示例,用 dplyr::group_by:
聚合我的数据n=10
df <- data.frame(id1 = sample(c("a","b"),n,T),id2 = sample(c("c","d"),n,T), # variables with IDs
var_sum1 = rnorm(n,0,1),var_sum2 = rnorm(n,5,1), # variables to sum
var_mean1 = rnorm(n,10,1), var_mean2 = rnorm(n,15,1), # variables to average
var_weighted_mean = rnorm(n,0,1), # vars to weight average
weight = sample(c(1,2),n,T), # weight
var_useless_1 = 1,var_useless_n = 1) # useless variables to throw away
final_dplyr <- df %>%
group_by(id1, id2) %>%
summarise(var_sum1 = sum(var_sum1),
var_sum2 = sum(var_sum2),
var_mean1 = mean(var_mean1),
var_mean2 = mean(var_mean2),
var_weighted_mean = weighted.mean(var_weighted_mean,weight))
现在,我想在向量中定义将遵循每个规则的变量:
ids <- c("id1","id2")
summing = c("var_sum1","var_sum2")
averaging = c("var_mean1","var_mean2")
wght_avergage = c("var_weighted_mean")
每个向量都将包含或多或少 20 个变量的名称,因此像我对 dplyr 玩具示例所做的那样“手动”聚合它会有点麻烦。
我可以使用 data.table 包来实现它吗?也欢迎其他解决方案,但是当我现在正在学习这个包时,我真的很感激 data.table 的解决方案。
我想过这样的事情(但由于我是 data.table 的新手,这可能是完全错误的):
dt <- as.data.table(df)
# line not working
dt[ , .(summing, averaging, wght_average) := list(lapply(.SD[,.(summing)],sum),
lapply(.SD[,.(averaging)],mean),
lapply(.SD[,.(wght_average)],function(x)weighted.mean(x,weight))),
by = .(ids),
.SDcols = .(summing, averaging, wght_average)]
感谢您的帮助!
【问题讨论】:
标签: r data.table aggregate