使用 apply 函数计算列的平均值答案

【问题标题】：Using apply function to calculate the mean of a column使用 apply 函数计算列的平均值
【发布时间】：2020-10-31 22:39:05
【问题描述】：

在按国家/地区将数据框拆分为多个数据框后，我希望能够计算我拆分的每个国家/地区数据框中列集中的平均值。我使用了有效的tapply，我尝试使用 sapply() 但奇怪的是该国家的所有平均值都遵循第一个国家的平均值。我不知道为什么，我被要求使用 sapply 作为练习，所以我想知道如何改进我的代码。任何指针将不胜感激。（这可能是一个愚蠢的错误）

输入/我的代码：

strikes.df = read.csv("http://www.stat.cmu.edu/~pfreeman/strikes.csv")
strikes.by.country=split(strikes.df,strikes.df$country)
my.fun=function(x=strikes.by.country){
  l=length(strikes.by.country)
  for (i in 1:l){
    return(strikes.by.country[[i]]$centralization %>% mean)
  }
}

sapply(strikes.by.country, my.fun)

#using tapply()
tapply(strikes.df[,"centralization",],INDEX=strikes.df[,"country",],FUN=mean)

输出

   0.374644    0.374644    0.374644    0.374644    0.374644 
    Finland      France     Germany     Ireland       Italy 
   0.374644    0.374644    0.374644    0.374644    0.374644 
      Japan Netherlands New.Zealand      Norway      Sweden 
   0.374644    0.374644    0.374644    0.374644    0.374644 
Switzerland          UK         USA 
   0.374644    0.374644    0.374644

 
  Australia     Austria     Belgium      Canada     Denmark 
0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 
    Finland      France     Germany     Ireland       Italy 
0.750374065 0.002729909 0.249968231 0.499711882 0.250699502 
      Japan Netherlands New.Zealand      Norway      Sweden 
0.124675342 0.749602699 0.375940378 0.875341821 0.875253817 
Switzerland          UK         USA 
0.499990005 0.375946785 0.002390639

在使用 split 后，我得到了使用 sapply 的指令；这就是为什么我唯一想到的就是使用 for 循环。

【问题讨论】：

您能否使用dput() 在问题中提供一些数据？此外，您在函数中定义变量x，但您不在函数体中使用它，您继续使用拆分数据框的名称。

标签： r function dataframe matrix apply

【解决方案1】：

最好在unique 国家名称上使用sapply。其实没有必要拆分任何东西。

sapply(unique(strikes.df$country), function(x) 
  mean(strikes.df[strikes.df$country == x, "centralization"]))
#   Australia     Austria     Belgium      Canada     Denmark     Finland      France 
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909 
#     Germany     Ireland       Italy       Japan Netherlands New.Zealand      Norway 
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821 
#      Sweden Switzerland          UK         USA 
# 0.875253817 0.499990005 0.375946785 0.002390639

但如果你也依赖使用split，你可以这样做：

sapply(split(strikes.df$centralization, strikes.df$country), mean)
#   Australia     Austria     Belgium      Canada     Denmark     Finland      France 
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909 
#     Germany     Ireland       Italy       Japan Netherlands New.Zealand      Norway 
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821 
#      Sweden Switzerland          UK         USA 
# 0.875253817 0.499990005 0.375946785 0.002390639

或者写成两行：

s <- split(strikes.df$centralization, strikes.df$country)
sapply(s, mean)

编辑

如果需要splitting 整个数据框，请执行

s <- split(strikes.df, strikes.df$country)
sapply(s, function(x) mean(x[, "centralization"]))

或

foo <- function(x) mean(x[, "centralization"])
sapply(s, foo)

【讨论】：

谢谢，但我得到了使用 split 后使用 sapply 的指示；这就是为什么我唯一发生的事情就是使用 for 循环，所以你能给我一些关于我的代码的指针吗？
@Isabella 我的split 行是从内到外评估的，所以首先应用split，其次是sapply(., mean)。查看编辑！
谢谢 :) 这真的很容易理解；是否可以使用拆分（strikes.df，strikes.df$country），然后像我一样将 sapply 与自定义功能一起使用？（说明仍然指定我仅按国家/地区拆分整个数据框）
谢谢，成功了！那么for循环有什么方法可以工作吗？

【解决方案2】：

使用gapminder::gapminder 数据集作为示例数据，可以这样实现：

示例代码通过continent 计算平均预期寿命 (lifeExp)。

# sapply: simplifies. returns a vector
sapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#>   Africa Americas     Asia   Europe  Oceania 
#> 48.86533 64.65874 60.06490 71.90369 74.32621
# lapply: returns a list
lapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> $Africa
#> [1] 48.86533
#> 
#> $Americas
#> [1] 64.65874
#> 
#> $Asia
#> [1] 60.0649
#> 
#> $Europe
#> [1] 71.90369
#> 
#> $Oceania
#> [1] 74.32621

【讨论】：