【问题标题】:Aggregate with multiple duplicates and calculate their mean聚合多个重复项并计算它们的平均值
【发布时间】:2018-03-30 22:02:53
【问题描述】:

假设我们有一个 DF,其受尊重的 UserID 有重复但命名不同,当然也可以是重复的。

DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))

目的是分别聚合和计算用户 ID 及其名称的均值和标准差。所需的输出示例:

UserID  Name     Class    Scoring_mean  Scoring_std
101     Ed       Junior   12.5          3
101     Hank     Junior   24.67         11.62
102     Sandy    High     24.75         6.29
102     Jessica  High     24.25         1.5

因此我的问题是:

  • 有哪些选项可以根据 UserID 聚合名称,而不会丢失信息(Hank 被强制转换为 Ed 等,如 summarise() 或 mutate() )

在我看来,R 必须检查哪个 Name 对应于 UserID,以及是否匹配;聚合并计算均值和标准差,但我无法使用 dplyr 在 R 中使用它。

同时我找不到与这个问题有些相关的任何其他帖子,例如:

  1. How to calculate the mean of specific rows in R?
  2. Subtract pairs of columns based on matching column
  3. Calculating mean when 2 conditions need met in R
  4. average between duplicated rows in R

【问题讨论】:

  • 群组paste(UserID, Name)?

标签: r dplyr aggregate


【解决方案1】:

如何计算您的摘要统计数据,然后将结果加入您的初始数据框。像这样:

DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
                 Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
                 Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
                 Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))


DF2 <- DF %>% group_by(Name) %>%
  summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
  left_join(DF[,c(1,2,3)], by="Name")

给予:

# A tibble: 9 x 5
  Name    scoring_mean scoring_sd    ID Class 
  <fct>          <dbl>      <dbl> <dbl> <fct> 
1 Ed              13.0      2.83   101. Junior
2 Ed              13.0      2.83   101. Junior
3 Hank            16.0      3.46   101. Junior
4 Hank            16.0      3.46   101. Junior
5 Hank            16.0      3.46   101. Junior
6 Jessica         25.5      0.707  102. Mid   
7 Jessica         25.5      0.707  102. Mid   
8 Sandy           21.0      1.41   102. High  
9 Sandy           21.0      1.41   102. High 

【讨论】:

  • 这个答案在这种情况下是不合适的,因为它没有连接观察结果,这是我首先需要的。也许我的问题在帖子中并不清楚,所以我很抱歉。
【解决方案2】:

这是一个tidyverse 选项,它使用一些重塑来创建一列分数,然后进行一些分组以获得摘要统计信息:

DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), 
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)

library(tidyverse)

DF %>%
  gather(score_type, score, Scoring, Other_Scores) %>%  # reshape score columns
  group_by(ID, Name, Class) %>%                         # group by combinations
  summarise(scoring_mean = mean(score),                 # get summary stats
            scoring_sd = sd(score)) %>%
  ungroup()                                             # forget the grouping

# # A tibble: 4 x 5
#       ID Name    Class  scoring_mean scoring_sd
#    <dbl> <fct>   <fct>         <dbl>      <dbl>
# 1  101. Ed      Junior         12.5       3.00
# 2  101. Hank    Junior         24.7      11.6 
# 3  102. Jessica Mid            24.2       1.50
# 4  102. Sandy   High           24.8       6.29

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2014-08-12
    • 2013-03-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-02-06
    相关资源
    最近更新 更多