【问题标题】:R summarise_at dynamically by condition : mean for some columns, sum for othersR summarise_at 动态地按条件:某些列的平均值,其他列的总和
【发布时间】:2020-02-19 18:20:46
【问题描述】:

我愿意,但条件是summarise_at()

编辑#1:我在标题中动态添加了这个词:当我在summarise_at() 中使用vars(c()) 时,它是为了快速清晰的示例,但实际上它是为了使用@ 987654324@、starts_with()matches(,, perl=TRUE),因为我有 50 列,其中有很多 sum() 和一些 mean()

目标是使用tbl()..%>% group_by() ... %>% summarise_at()...%>% collect() 生成动态SQL。

编辑 #2:我添加了第二个示例中生成的 SQL 示例

library(tidyverse)
(mtcars 
  %>% group_by(carb)
  %>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
  # I don't want this line below, I would like a conditional in summarise_at() because I have 50 columns in my real case
  %>% select(carb,cyl_mean,disp_mean,mpg_sum)
)
#> # A tibble: 6 x 4
#>    carb cyl_mean disp_mean mpg_sum
#>   <dbl>    <dbl>     <dbl>   <dbl>
#> 1     1     4.57      134.   177. 
#> 2     2     5.6       208.   224  
#> 3     3     8         276.    48.9
#> 4     4     7.2       309.   158. 
#> 5     6     6         145     19.7
#> 6     8     8         301     15

Created on 2020-02-19 by the reprex package (v0.3.0)

这行得通,但我只想为 mpg 求和,只为 cyl 和 disp 表示:

library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)

db <- dbConnect(SQLite(),":memory:")

dbCreateTable(db, "mtcars_table", mtcars)

(tbl( db, build_sql( con=db,"select * from mtcars_table" ))
  %>% group_by(carb)
  %>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
  %>% select(carb,cyl_mean,disp_mean,mpg_sum)                   
  %>% show_query()
)
#> <SQL>
#> Warning: Missing values are always removed in SQL.[...]  to silence this warning
#> SELECT `carb`, `cyl_mean`, `disp_mean`, `mpg_sum`
#> FROM (SELECT `carb`, AVG(`mpg`) AS `mpg_mean`, AVG(`cyl`) AS `cyl_mean`, AVG(`disp`) AS `disp_mean`, SUM(`mpg`) AS `mpg_sum`, SUM(`cyl`) AS `cyl_sum`, SUM(`disp`) AS `disp_sum`
#> FROM (select * from mtcars_table)
#> GROUP BY `carb`)
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.30.1 [:memory:]
#> # … with 4 variables: carb <dbl>, cyl_mean <lgl>, disp_mean <lgl>,
#> #   mpg_sum <lgl>

我尝试了所有类似的可能性,但它不起作用或产生错误。

(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse(vars(contains(names(.),"mpg")),list(sum(.)),list(mean(.)))) )

不好,列太多

library(tidyverse)
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse ((names(.)=="mpg"), list(~sum(.)) , list(~mean(.)))))
#> # A tibble: 6 x 34
#>    carb mpg_sum cyl_sum disp_sum mpg_mean..2 cyl_mean..2 disp_mean..2
#>   <dbl>   <dbl>   <dbl>    <dbl>       <dbl>       <dbl>        <dbl>
#> 1     1   177.       32     940.        25.3        4.57         134.
#> 2     2   224        56    2082.        22.4        5.6          208.
#> 3     3    48.9      24     827.        16.3        8            276.
#> 4     4   158.       72    3088.        15.8        7.2          309.
#> 5     6    19.7       6     145         19.7        6            145 
#> 6     8    15         8     301         15          8            301 
#> # … with 27 more variables: mpg_mean..3 <dbl>, cyl_mean..3 <dbl>,
#> #   disp_mean..3 <dbl>, mpg_mean..4 <dbl>, cyl_mean..4 <dbl>,
#> #   disp_mean..4 <dbl>, mpg_mean..5 <dbl>, cyl_mean..5 <dbl>,
#> #   disp_mean..5 <dbl>, mpg_mean..6 <dbl>, cyl_mean..6 <dbl>,
#> #   disp_mean..6 <dbl>, mpg_mean..7 <dbl>, cyl_mean..7 <dbl>,
#> #   disp_mean..7 <dbl>, mpg_mean..8 <dbl>, cyl_mean..8 <dbl>,
#> #   disp_mean..8 <dbl>, mpg_mean..9 <dbl>, cyl_mean..9 <dbl>,
#> #   disp_mean..9 <dbl>, mpg_mean..10 <dbl>, cyl_mean..10 <dbl>,
#> #   disp_mean..10 <dbl>, mpg_mean..11 <dbl>, cyl_mean..11 <dbl>,
#> #   disp_mean..11 <dbl>

其他一些尝试和评论:我想要有条件的sum(.)mean(.),具体取决于summarise() 中的列名。

如果它不仅接受原始函数,那就太好了。

最后由tbl()..%&gt;% group_by() ... %&gt;% summarise_at()...%&gt;% collect() 使用AVG()SUM() 生成条件SQL。

~(convert(varchar()) 这样的T-SQL 函数适用于mutate_at() 和类似的~AVG() 适用于summarise_at() 但我到达了同一点:有条件的summarise_at() 不起作用,具体取决于列的名称。

:)

【问题讨论】:

    标签: sql r sql-server dplyr dbplyr


    【解决方案1】:

    一个选项是 group_by 'carb',然后创建 'mpg' 的 sum 作为另一个分组变量,然后将 summarise_at 与所需的其余变量一起使用

    library(dplyr)
    mtcars %>%
        group_by(carb) %>%
        group_by(mpg_sum = sum(mpg), .add = TRUE) %>%
        summarise_at(vars(cyl, disp), list(mean = mean))
    # A tibble: 6 x 4
    # Groups:   carb [6]
    #   carb mpg_sum cyl_mean disp_mean
    #  <dbl>   <dbl>    <dbl>     <dbl>
    #1     1   177.      4.57      134.
    #2     2   224       5.6       208.
    #3     3    48.9     8         276.
    #4     4   158.      7.2       309.
    #5     6    19.7     6         145 
    #6     8    15       8         301 
    

    或者使用dplyrdevel 版本,这可以在单个summarise 中完成,方法是将across 中的列块和单个列单独包装并在其上应用不同的功能

    mtcars %>%
      group_by(carb) %>% 
      summarise(across(one_of(c("cyl", "disp")), list(mean = mean)), 
                mpg_sum = sum(mpg))
    # A tibble: 6 x 4
    #   carb cyl_mean disp_mean mpg_sum
    #  <dbl>    <dbl>     <dbl>   <dbl>
    #1     1     4.57      134.   177. 
    #2     2     5.6       208.   224  
    #3     3     8         276.    48.9
    #4     4     7.2       309.   158. 
    #5     6     6         145     19.7
    #6     8     8         301     15  
    

    注意:summarise_at/summarise_if/mutate_at/mutate_if/... 等将在即将发布的版本中被具有默认功能 (summarise/mutate/filter/...) 的 across 动词取代

    【讨论】:

    • 明天我将在我的真实案例中尝试这两种方法,包括使用 Mssql 生成 :)
    • 我还没有尝试过,但您的解决方案 1 对我来说有欠缺:):第一个分组函数 sum(mpg) 不是动态的。当我使用vars(c()) 时,它是用于快速清晰的示例,但实际上它是用于使用contains()Starts_With()matches(,, perl=TRUE),因为我有50 列,其中有很多sum() 和一些mean()。如果我不能使用这些动态函数,我会用paste0() 编写SQL。 :)
    • @phili_b 和summarise_at,您想要获取输出的方式存在限制。我认为一旦开发版本与 sql 集成,cross 就可以工作了
    • @phili_b 那么最好在 dplyr github 上请求,因为他们决定了
    【解决方案2】:

    使用正则表达式等待across() 的解决方法

    library(RSQLite)
    library(dbplyr)
    library(tidyverse)
    library(DBI)
    
    db <- dbConnect(SQLite())
    
    mtcars_table <- mtcars %>% rename(mpg_sum=mpg,cyl_mean=cyl,disp_mean=disp )
    
    RSQLite::dbWriteTable(db, "mtcars_table", mtcars_table)
    
    req<-as.character((tbl( db, build_sql( con=db,"select * from mtcars_table" ))
                       %>% group_by(carb)
                       %>% summarise_at(vars(c(ends_with("mean"), ends_with("sum")) ), ~sum(.))
    
    ) %>% sql_render())
    #> Warning: Missing values are always removed in SQL.
    #> Use `SUM(x, na.rm = TRUE)` to silence this warning
    #> This warning is displayed only once per session.
    
    req<-gsub("(SUM)(\\(.{1,30}mean.{1,10}\\))", "AVG\\2", req, perl=TRUE)
    print(req)
    #> [1] "SELECT `carb`, AVG(`cyl_mean`) AS `cyl_mean`, AVG(`disp_mean`) AS `disp_mean`, 
    # SUM(`mpg_sum`) AS `mpg_sum`\nFROM (select * from mtcars_table)\n
    # GROUP BY `carb`"
    
    dbGetQuery(db, req)
    #>   carb cyl_mean disp_mean mpg_sum
    #> 1    1 4.571429  134.2714   177.4
    #> 2    2 5.600000  208.1600   224.0
    #> 3    3 8.000000  275.8000    48.9
    #> 4    4 7.200000  308.8200   157.9
    #> 5    6 6.000000  145.0000    19.7
    #> 6    8 8.000000  301.0000    15.0
    

    sessionInfo()

    R version 3.6.1 (2019-07-05)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 16.04.6 LTS
    
    Matrix products: default
    BLAS:   /usr/lib/libblas/libblas.so.3.6.0
    LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
     [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
    [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
     [1] DBI_1.1.0       forcats_0.4.0   stringr_1.4.0   dplyr_0.8.4     purrr_0.3.3    
     [6] readr_1.3.1     tidyr_1.0.2     tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
    [11] dbplyr_1.4.2    RSQLite_2.2.0  
    
    loaded via a namespace (and not attached):
     [1] xfun_0.10        tidyselect_1.0.0 haven_2.2.0      lattice_0.20-38  colorspace_1.4-1
     [6] vctrs_0.2.2      generics_0.0.2   htmltools_0.4.0  blob_1.2.1       rlang_0.4.4     
    [11] pillar_1.4.3     glue_1.3.1       withr_2.1.2      bit64_0.9-7      modelr_0.1.5    
    [16] readxl_1.3.1     lifecycle_0.1.0  munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0
    [21] rvest_0.3.5      memoise_1.1.0    evaluate_0.14    knitr_1.25       callr_3.3.2     
    [26] ps_1.3.0         fansi_0.4.1      broom_0.5.2      Rcpp_1.0.3       clipr_0.7.0     
    [31] scales_1.1.0     backports_1.1.5  jsonlite_1.6.1   fs_1.3.1         bit_1.1-15.1    
    [36] hms_0.5.3        digest_0.6.23    stringi_1.4.5    processx_3.4.1   grid_3.6.1      
    [41] cli_2.0.1        tools_3.6.1      magrittr_1.5     lazyeval_0.2.2   whisker_0.4     
    [46] crayon_1.3.4     pkgconfig_2.0.3  xml2_1.2.2       reprex_0.3.0     lubridate_1.7.4 
    [51] assertthat_0.2.1 rmarkdown_1.16   httr_1.4.1       rstudioapi_0.10  R6_2.4.1        
    [56] nlme_3.1-141     compiler_3.6.1  
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-03-11
      • 2022-11-15
      • 1970-01-01
      • 1970-01-01
      • 2019-09-24
      • 2018-05-26
      • 1970-01-01
      相关资源
      最近更新 更多