【问题标题】:R multidplyr for summarise_at work around?用于 summarise_at 的 R multidplyr 可以解决吗?
【发布时间】:2020-07-25 13:12:30
【问题描述】:

我想使用multidplyr,它还没有任何summarise_at。我有数百甚至数千,所以 summarise_at 是必要的,但不幸的是,在 multidplyr 中不可用。

寻找替代方法来解决它。

library('tidyverse')
df <- tibble(ID = c('a','a','b','c','c','e','e','f','g','g'),
              var1 = floor(runif(10, min=0, max=100)),
              var2 = floor(runif(10, min=0, max=100)),
              var3 = floor(runif(10, min=0, max=100)),
              var4 = floor(runif(10, min=0, max=100))
              )

library('multidplyr')
cluster <- new_cluster(5)

#works
df %>% 
  group_by(ID) %>% 
  #partition(cluster) %>% 
  summarise_at(.vars = vars(starts_with('var')),sum) 
  #collect()

#works
df %>% 
  group_by(ID) %>% 
  partition(cluster) %>% 
  summarise(var1 = sum(var1),
            var2 = sum(var2),
            var3 = sum(var3)) %>% 
  collect()

#doesnt works
df %>% 
  group_by(ID) %>% 
  partition(cluster) %>%
  summarise_at(.vars = vars(starts_with('var')),sum)  %>% 
  collect()

我什至试过这个

#Define character string vector to replace command line
sum_var <- select(df,starts_with('var')) %>% names()
sum_var_str <- paste0(sum_var," = sum(",sum_var,")")
sum_var_str <- str_c(sum_var_str, collapse = ", ")
> sum_var
[1] "var1" "var2" "var3" "var4"
> sum_var_str
[1] "var1 = sum(var1), var2 = sum(var2), var3 = sum(var3), var4 = sum(var4)"

#works
df %>% 
  group_by(ID) %>% 
  { eval(parse(text = sprintf("summarise(., %s, .groups = 'drop')", sum_var_str))) }

#doesn't works
df %>% 
  group_by(ID) %>% 
  partition(cluster) %>%
  { eval(parse(text = sprintf("summarise(., %s, .groups = 'drop')", sum_var_str))) } %>%
  collect()

【问题讨论】:

  • 这个问题和你之前的问题有什么不同? stackoverflow.com/questions/63088146/… 你想要与multidplyr 一起使用的东西,对吗?
  • 我认为可行的解决方法没有。尽管该解决方案有效(没有 multidplyr),但它不在我需要的 multidplyr 环境下。大数据????
  • @JimmyR:你试过tidytable 了吗? github.com/markfairbanks/tidytable
  • @tung 不错。谢谢,会检查一下。看到了基准。比 tidyverse 好多了。如果可能的话,我仍然热衷于并行核心处理。

标签: r dplyr multidplyr


【解决方案1】:

找到解决办法

library('dplyr')
library('multidplyr')
library('parallel')
cluster <- new_cluster(detectCores())

df <- tibble(ID = c('a','a','b','c','c','e','e','f','g','g'),
             var1 = floor(runif(10, min=0, max=100)),
             var2 = floor(runif(10, min=0, max=100)),
             var3 = floor(runif(10, min=0, max=100)),
             var4 = floor(runif(10, min=0, max=100))
)

sum_var <- select(df,starts_with('var')) %>% names()

#assign vector to cluster
cluster_assign(cluster, sum_var = sum_var)
cluster_library(cluster, 'dplyr')

df %>% 
  group_by(ID) %>% 
  partition(cluster) %>% 
  summarise(across(all_of(sum_var), sum)) %>% 
  collect()

# A tibble: 6 x 5
  ID     var1  var2  var3  var4
  <chr> <dbl> <dbl> <dbl> <dbl>
1 a        57    72    85   118
2 b        46    50    80    33
3 c        82   156    96   154
4 e       122   107    93   120
5 f        33     7    49    36
6 g        99    79    83    56

【讨论】:

    猜你喜欢
    • 2014-05-06
    • 1970-01-01
    • 2014-01-26
    • 1970-01-01
    • 2012-08-21
    • 1970-01-01
    • 2023-02-10
    • 2022-11-19
    • 2011-08-11
    相关资源
    最近更新 更多