如何按组计算 FPKM 基因计数的平均值和 sd 并将平均值和 sd 组合为数据框？答案

【问题标题】：How calculate the mean and sd of FPKM gene counts by group and combind the mean and sd as dataframe?如何按组计算 FPKM 基因计数的平均值和 sd 并将平均值和 sd 组合为数据框？
【发布时间】：2021-01-27 10:33:31
【问题描述】：

幸运的是，分组计算mean和sd的第一步已经完成。现在我分别得到了mean 和sd 结果。而我想做的是如何将主题结合在一起。不管组合方法有多简单或多难，但组合数据框应该简单还是不复杂。

这里我将向您展示我的计算方法和我所知道的唯一组合方法。我需要另一种新的组合方法。请。我的示例数据和代码如下：

data<-data.frame(matrix(sample(1:1000,500),20,25))
names(data) <- c(paste0("Gene_", 1:25))
rownames(data)<-NULL
data$Name<-c(rep(paste0("Group_",1:10),each=2))
        
unique(data$Name)
## 1 group_by, only get one result each time
mm <- data %>% 
  group_by(data$Name) %>% 
  summarise(mean=mean(Gene_1))
mm

## 2 tapply, can get the mean of each column , but only one column each time.
mm <- data.frame(mean_Gene_1=tapply(data[,"Gene_1"],data$Name,mean))  
mm

## 3.aggregate, a powerful function , can get all the columns result by group.
mm <- aggregate(.~Name,data,mean) 
mm
        
## get the mean and sd dataframe.
mean <- aggregate(.~Name,data,mean) 
sd <- aggregate(.~Name,data,sd) 
        
## now combine the two dataframe usingt the same index "Name" and "gene"        
## I just learned one method from somebody in Stack overflow. 
## combine the two file 
data <- bind_rows(list(mean = mean, sd = sd), .id = "stat")
        
data_mean_sd <- data %>% 
  pivot_longer(-c(Name, stat), names_to = "Gene", values_to = "value") %>%
  pivot_wider(names_from = "stat", values_from = "value")

你知道结果是对的。但它是一个大文件，虽然它是这里的一个例子。它包括许多重复的数据。我希望有人给我一个更好的方法来简化我的结果。

我需要你的帮助。

【问题讨论】：

标签： r mean

【解决方案1】：

我不确定，以下方法是否适合您？最后一部分基本相同，使用pivot_longer和pivot_wider，但总结部分我使用dplyr::across。

library(dplyr)
library(tidyr)

data<-data.frame(matrix(sample(1:1000,500),20,25))
names(data) <- c(paste0("Gene_", 1:25))
rownames(data)<-NULL
data$Name<-c(rep(paste0("Group_",1:10),each=2))


data %>% 
  group_by(Name) %>% 
  summarise(across(everything(),
                   list(mean = ~ mean(.x),
                        sd = ~ sd(.x)),
                   .names = "{col}__{fn}")) %>% 
  pivot_longer(-c(Name), names_to = "Gene", values_to = "value") %>% 
  separate(., Gene, into = c("Gene", "Stats"), sep = "__") %>% 
  pivot_wider(names_from = Stats, values_from = "value")

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 250 x 4
#>    Name    Gene     mean     sd
#>    <chr>   <chr>   <dbl>  <dbl>
#>  1 Group_1 Gene_1   534. 556.  
#>  2 Group_1 Gene_2   294.  51.6 
#>  3 Group_1 Gene_3   262. 350.  
#>  4 Group_1 Gene_4   615  338.  
#>  5 Group_1 Gene_5    89   43.8 
#>  6 Group_1 Gene_6   322  263.  
#>  7 Group_1 Gene_7   696. 391.  
#>  8 Group_1 Gene_8   182. 101.  
#>  9 Group_1 Gene_9   582  139.  
#> 10 Group_1 Gene_10  184    2.83
#> # ... with 240 more rows

^{由reprex package (v0.3.0) 于 2021-01-27 创建}

【讨论】：

谢谢。你只是换了一个方法，结果和我的一样。
我试图简化方法，但我不确定如何简化结果。对于每个组和基因，您应该有一个 mean 和一个 sd，这将为您提供 500 个值。您能解释一下您希望如何简化结果吗？
对不起，我考虑过了。可能没有。我有一个 fpkm 归一化基因计数文件，得到了平均值和 sd 结果并将它们组合在一起。该文件有 2262224 行和 4 列。我想我应该找到一些其他的方法。感激不尽。