R中的汇总统计答案

【问题标题】：Summary Statistics in RR中的汇总统计
【发布时间】：2017-02-20 20:51:02
【问题描述】：

如何同时为来自不同物种（第 1 列）的多个类别（第 1 行的不同测量值）生成一些汇总统计数据（平均值、标准差、范围、样本大小），并使用“write.csv()到一个数据文件。如果我一次只做一个物种，我可以很容易地做到这一点，但我想将所有物种的所有数据放在一个 .csv 文件中，一次生成总和统计数据。"

【问题讨论】：

欢迎来到 StackOverflow！请快速阅读how to ask 并查看how to make a reproducible example。然后，您可以返回并编辑您的问题，添加一个示例和一些代码来展示您尝试过的内容以及有助于澄清您的问题的任何其他内容。

标签： r csv statistics summary

【解决方案1】：

我知道你在说什么。假设您想获得平均值、标准差、范围和样本量。因为 R 给出的函数范围没有给你一个数字，而是数据集中的最小数字逗号最大数字，所以它给了我一个问题。魔法就在 tapply() 中。我只是使用了 transpose t() 和 as.matrix 来使其更容易放入数据框中。

不管怎样，看看内置的 iris 数据集。

data(iris)

我将为您提供所有这些仅与 Sepal Length 相关的平均值、标准差和样本大小，使用 rbind 将所有值写入数据帧的行，然后最后使用 rownames() 为行命名。

这样做：

mean_sepal_length = t(as.matrix(tapply(iris$Sepal.Length, iris$Species, mean)))
mean_sepal_length

sd_sepal_length = t(as.matrix(tapply(iris$Sepal.Length, iris$Species, FUN = sd)))
sd_sepal_length


sample_size_sepal_length = t(as.matrix(tapply(iris$Sepal.Length, iris$Species, FUN = length)))
sample_size_sepal_length


df_sepal_length <- data.frame(mean_sepal_length)
df_sepal_length

View(df_sepal_length)

df_sepal_length = rbind(df_sepal_length, sd_sepal_length)

df_sepal_length = rbind(df_sepal_length, sample_size_sepal_length)

rownames(df_sepal_length) <- c("Mean_sepal_length", "sd_sepal_length", "size_sepal_length")

write.csv(df_sepal_length, "C:/Users/me/Documents/tapply_miracle.csv")

【讨论】：

非常感谢。我可以分别获取每个物种的所有这些数据，但是当我想要同一数据矩阵（.csv 文件）中多个物种的这些数据时，我想一次完成所有这些数据，而不是将矩阵切割成特定于单个物种的矩阵数据矩阵单独运行。有什么脚本吗？

【解决方案2】：

我正在考虑我当天给出的答案，当我意识到 tapply 函数可以接受 INDEX 变量作为列表时，我认为它可能会更好。在我的示例中，我只知道 tapply 可以对一个因素进行分类，但我们可以指定多个因素。诀窍是使用函数 melt() 将 iris 数据帧从宽格式融合为长格式，使其更具可读性，然后使用列表参数点击：

       > install.packages("reshape2")
        > library(reshape2)

    # I used melt to restyle the iris dataframe from wide to long turning the many columns into rows with less columns, and I coerced the iris dataset back to a dataframe.   

        > iris_melt <- data.frame(melt(data = iris, id = "Species", variable.name = "iris_factors", value.name = "iris_dimensions_cm"))


   > head(iris_melt)
  Species iris_factors iris_dimensions_cm
1  setosa Sepal.Length                5.1
2  setosa Sepal.Length                4.9
3  setosa Sepal.Length                4.7
4  setosa Sepal.Length                4.6
5  setosa Sepal.Length                5.0
6  setosa Sepal.Length                5.4

在这里，我们将获得所有虹膜因素的平均花朵尺寸：萼片长度、萼片宽度、花瓣长度和所有物种（setosa、virginica、versicolor）的花瓣宽度。

> tapply(X = iris_melt$iris_dimensions_cm, INDEX = list(iris_melt$Species, iris_melt$iris_factors), FUN = mean)
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

如果我们改变 INDEXed 列表中因子的顺序，我们可以通过翻转行和列来获得以稍微不同的格式呈现给我们的相同信息：

> tapply(X = iris_melt$iris_dimensions_cm, INDEX = list(iris_melt$iris_factors, iris_melt$Species), FUN = mean)
             setosa versicolor virginica
Sepal.Length  5.006      5.936     6.588
Sepal.Width   3.428      2.770     2.974
Petal.Length  1.462      4.260     5.552
Petal.Width   0.246      1.326     2.026

获得标准差很容易。只需更改 FUN 参数：

> tapply(X = iris_melt$iris_dimensions_cm, INDEX = list(iris_melt$iris_factors, iris_melt$Species), FUN = sd)
                setosa versicolor virginica
Sepal.Length 0.3524897  0.5161711 0.6358796
Sepal.Width  0.3790644  0.3137983 0.3224966
Petal.Length 0.1736640  0.4699110 0.5518947
Petal.Width  0.1053856  0.1977527 0.2746501

现在我基本上不用 Rbind 了。

【讨论】：