【问题标题】:How to select columns to work with median function in dplyr? [duplicate]如何选择列以使用 dplyr 中的中值函数? [复制]
【发布时间】:2020-06-25 10:04:30
【问题描述】:

我有以下数据集:

df <- tribble(
 ~id,  ~name,  ~day_1,   ~day_2,  ~day_3,  ~day_4,  ~rank,
 "101",  "a",     5,          2,      1,       8,     '1',
 "202",  "b",     8,          4,      5,       5,     '2',
 "303",  "c",    10,          6,      9,       6,     '3',
 "404",  "d",    12,          8,      5,       7,     '4',
 "505",  "e",    14,         10,      7,       9,     '5',
 "607",  "f",     5,          2,      1,       8,     '6',
 "707",  "g",     8,          4,      5,       5,     '7',    
 "808",  "h",    10,          6,      9,       6,     '8',
 "909",  "k",    12,          8,      5,       7,     '9',
"1009",  "l",    14,         10,      7,       9,    '10',
)

感谢@Edward 创建了top 变量并按top 对数据进行分组后,我采用了以天开头的每一列的值的中值。代码如下:

df %>%
 mutate(top = ifelse(rank <= 1, 1,
                     ifelse(rank <= 3, 3,
                            ifelse(rank <= 5, 5,
                                   ifelse(rank <= 7, 7,
                                          ifelse(rank <= 8, 8, 10)))))) %>%
 group_by(top) %>%
 summarize(day_1 = median(as.numeric(day_1), na.rm = TRUE),
           day_2 = median(as.numeric(day_2), na.rm = TRUE),
           day_3 = median(as.numeric(day_3), na.rm = TRUE),
           day_4 = median(as.numeric(day_4), na.rm = TRUE)) 

结果如下:

# A tibble: 6 x 5
   top day_1 day_2 day_3 day_4
 <dbl> <dbl> <dbl> <dbl> <dbl>
1     1   5       2     1   8  
2     3  10       6     7   6  
3     5  13       9     6   8  
4     7   6.5     3     3   6.5
5     8  10       6     9   6  
6    10  12       8     5   7

但是,由于我的真实数据集中有近 40 个以 day 开头的列,因此我想使用一个函数来更有效地执行此操作,而不是像 summarize(day_1 = median(as.numeric(day_1), na.rm = TRUE) 这样编写所有列名。

对此有什么想法吗?

【问题讨论】:

  • dplyr的开发版本中查看summarise_atacross;还要检查stackoverflow.com/questions/9723208/…
  • 感谢您的建议。我添加了这个:``` summarise_at(vars(starts_with('day')), median) ``` 但它给出了以下错误:期望一个单边公式、一个函数或一个函数名。 @arg0naut91

标签: r dataframe dplyr


【解决方案1】:

这适用于您的测试数据:

 df %>%
  mutate(top = ifelse(rank <= 1, 1,
                      ifelse(rank <= 3, 3,
                             ifelse(rank <= 5, 5,
                                    ifelse(rank <= 7, 7,
                                           ifelse(rank <= 8, 8, 10)))))) %>%
  group_by(top) %>%
  summarise_at(vars(starts_with("day")), ~median(as.numeric(.x), na.rm = TRUE))


# A tibble: 6 x 5
    top day_1 day_2 day_3 day_4
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1   5       2     1   8  
2     3  10       6     7   6  
3     5  13       9     6   8  
4     7   6.5     3     3   6.5
5     8  10       6     9   6  
6    10  12       8     5   7  

【讨论】:

  • 我们必须在变异前指定df$rank=as.numeric(df$rank)。没有它,对于 rank >=10 的行(具有两位数字的值),输入“3”(而不是 10)。 @jay.sf 的结果是正确的。
【解决方案2】:

在基础 R 中,您可以使用 aggregate。我从您的起始数据框开始并实施cut 方法来创建top 列。

res <- aggregate(cbind(day_1, day_2, day_3, day_4) ~ top, 
                 transform(df, top=cut(as.numeric(df$rank), c(0, 1, 3, 5, 7, 8, 10),
                                       c(1, 3, 5, 7, 8, 10))), 
                 FUN=function(x) median(as.numeric(x)))
res
#   top day_1 day_2 day_3 day_4
# 1   1   5.0     2     1   8.0
# 2   3   9.0     5     7   5.5
# 3   5  13.0     9     6   8.0
# 4   7   6.5     3     3   6.5
# 5   8  10.0     6     9   6.0
# 6  10  13.0     9     6   8.0

as_tibble(res)
# # A tibble: 6 x 5
# top   day_1 day_2 day_3 day_4
# <fct> <dbl> <dbl> <dbl> <dbl>
#   1 1       5       2     1   8  
# 2 3       9       5     7   5.5
# 3 5      13       9     6   8  
# 4 7       6.5     3     3   6.5
# 5 8      10       6     9   6  
# 6 10     13       9     6   8  

【讨论】:

    【解决方案3】:
     df %>%
        mutate(top = ifelse(rank <= 1, 1,
                            ifelse(rank <= 3, 3,
                                   ifelse(rank <= 5, 5,
                                          ifelse(rank <= 7, 7,
                                                 ifelse(rank <= 8, 8, 10)))))) %>%
        group_by(top) %>%
        summarise_at(3:6, median, na.rm = TRUE) #columns from 3 to 6 change it if you have more of them (days) e.g. 3:40
    
    # A tibble: 6 x 5
        top day_1 day_2 day_3 day_4
      <dbl> <dbl> <dbl> <dbl> <dbl>
    1     1   5       2     1   8  
    2     3  10       6     7   6  
    3     5  13       9     6   8  
    4     7   6.5     3     3   6.5
    5     8  10       6     9   6  
    6    10  12       8     5   7 
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2015-10-26
      • 2014-08-05
      • 2014-10-23
      • 2020-11-11
      • 1970-01-01
      • 1970-01-01
      • 2017-04-13
      相关资源
      最近更新 更多