【问题标题】:R use group_by and summarise for 7 variables, but only get one result?R 使用 group_by 并汇总 7 个变量,但只得到一个结果?
【发布时间】:2020-09-13 21:45:28
【问题描述】:

我有一个大型数据集,我按年份对数据集进行分组并选择 7 个变量,然后我使用 summarise,尝试按组获取每个变量的统计信息。但我只得到每组的统计数据,而不是每个变量的统计数据。我如何解释结果?我怎样才能得到每个变量的结果?

v<-colnames(Cashflow)[c(2,4:ncol(Cashflow))]
Cstats<-Cashflow%>%
  group_by(Y)%>%
  summarise(mean = mean(get(v),na.rm = TRUE),
            observation = n(),
            sd = sd(get(v),na.rm = TRUE),
            min = min(get(v),na.rm = TRUE),
            q25 = quantile(get(v),probs = c(0.25),na.rm = TRUE),
            median = median(get(v),na.rm = TRUE),
            q75 = quantile(get(v),probs = c(0.75),na.rm = TRUE),
            max = max(get(v),na.rm = TRUE))```

而我的结果是这样的:

year mean sd min
1997 1    2   3
1998 2    3   4

一旦我添加了 for 循环:

    for (name in v){
      Cashflow%>%
      group_by(Y)%>%
      summarise(mean = mean(get(name),na.rm = TRUE),
                observation = n(),
                sd = sd(get(name),na.rm = TRUE),

我得到错误:

summarise() 取消分组输出(用.groups 参数覆盖)

summarise() 取消分组输出(用.groups 参数覆盖)

summarise() 取消分组输出(用.groups 参数覆盖)

有人可以给我一些建议吗?

【问题讨论】:

  • 这不是错误。只是一个友好的警告。您可以使用groups = 'drop'summarise 中的其他选项将其删除

标签: r


【解决方案1】:

如果我们想对多列执行此操作,请使用across 而不是get(并且get 仅返回第一列的值)

library(dplyr)
Cashflow %>%
   group_by(Y)%>%
   summarise(across(v,  
                     list(mean = ~ mean(., na.rm = TRUE),
                           sd = ~ sd(., na.rm = TRUE),
                            min = ~ min(., na.rm = TRUE),
                               median = ~ median(., na.rm = TRUE),
                               q25 = ~ quantile(., probs = 0.25, na.rm = TRUE),
                               q75 = ~ quantile(., probs = 0.75, na.rm = TRUE))),
        observation = n(), .groups = 'drop')  

使用可重现的示例

data(mtcars)
v <- names(mtcars)[c(1, 3:7)]
mtcars %>% 
   group_by(gear) %>%
   summarise(across(v,    list(mean = ~ mean(., na.rm = TRUE),
                            sd = ~ sd(., na.rm = TRUE),
                             min = ~ min(., na.rm = TRUE),
                                median = ~ median(., na.rm = TRUE),
                                q25 = ~ quantile(., probs = 0.25, na.rm = TRUE),
                                q75 = ~ quantile(., probs = 0.75, na.rm = TRUE))),
         observation = n(), .groups = 'drop')
# A tibble: 3 x 39
#  gear mpg_mean mpg_sd mpg_min mpg_median mpg_q25 mpg_q75 disp_mean disp_sd disp_min disp_median disp_q25 disp_q75 hp_mean hp_sd
#  <dbl>    <dbl>  <dbl>   <dbl>      <dbl>   <dbl>   <dbl>     <dbl>   <dbl>    <dbl>       <dbl>    <dbl>    <dbl>   <dbl> <dbl>
#1     3     16.1   3.37    10.4       15.5    14.5    18.4      326.    94.9    120.         318     276.       380   176.   47.7
#2     4     24.5   5.28    17.8       22.8    21      28.1      123.    38.9     71.1        131.     78.9      160    89.5  25.9
#3     5     21.4   6.66    15         19.7    15.8    26        202.   115.      95.1        145     120.       301   196.  103. 
# … with 24 more variables: hp_min <dbl>, hp_median <dbl>, hp_q25 <dbl>, hp_q75 <dbl>, drat_mean <dbl>, drat_sd <dbl>,
#   drat_min <dbl>, drat_median <dbl>, drat_q25 <dbl>, drat_q75 <dbl>, wt_mean <dbl>, wt_sd <dbl>, wt_min <dbl>, wt_median <dbl>,
#   wt_q25 <dbl>, wt_q75 <dbl>, qsec_mean <dbl>, qsec_sd <dbl>, qsec_min <dbl>, qsec_median <dbl>, qsec_q25 <dbl>, qsec_q75 <dbl>,
#   observation <int>


                        

【讨论】:

  • 您能解释一下为什么我们在这里使用符号“~”吗?为什么函数 n() 应该在 cross() 之外?谢谢
  • @ling 如果我们在内部使用n(),它将创建那 7 n 列,这将给出相同的输出。这就是原因,它被保持输出。 ~function(x) 的匿名函数简写
  • 明白。谢谢你的解释!
猜你喜欢
  • 2022-01-12
  • 2018-11-03
  • 2014-09-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-07-06
  • 2016-09-17
  • 2023-02-10
相关资源
最近更新 更多