将函数应用于数据框所有列的每列因子答案

【问题标题】：apply a function to factors of each column for all columns of a data frame将函数应用于数据框所有列的每列因子
【发布时间】：2020-10-01 11:38:14
【问题描述】：

我有一个包含 6 列的数据框。前 4 列各包含 2 个因子。我想编写一个函数（或 for 循环）来在每列的因子之间针对 pc1 和 pc2 列的值执行测试（例如 wilcox.test）。

如果我要手动操作：

wilcox.test(df[df$g1=="bm",5],df[df$g1!="bm",5])
wilcox.test(df[df$g1=="bm",6],df[df$g1!="bm",6])

我怎样才能得到每个测试的p.values 存储在一个数据框中，其中rows 等于df 和columns 的前4 列等于pc1 和pc2。

我试过了，但不正确：

mutate_if(df[,head(colnames(df),-2)], is.character, as.factor) %>% #check whether 4 first columns are as factor
  lapply(.,
  function(x) {
    df = data.frame(row.names = head(colnames(df),-2))
         names(df) = c("pc1", "pc2")
         df$pc1 = wilcox.test(df[df$g1=="bm",5],df[df$g1!="bm",5])
         df$pc2 = wilcox.test(df[df$g1=="bm",6],df[df$g1!="bm",6])
         return(df)
       }
)

我的数据框

> dput(df)
structure(list(g1 = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 
1L, 1L), .Label = c("bm", "cm"), class = "factor"), g2 = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("ct", "ft"), class = "factor"), 
    g3 = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("bn", 
    "un"), class = "factor"), g4 = structure(c(2L, 2L, 1L, 1L, 
    1L, 1L, 1L, 2L, 2L, 2L), .Label = c("ls", "vp"), class = "factor"), 
    pc1 = c(0.86, 0.54, 0.06, 0.88, 0.62, 0.14, 0.94, 0.8, 0.34, 
    0.04), pc2 = c(0.04, 0.9, 0.68, 0.54, 0.92, 0.36, 0.3, 0.62, 
    0.84, 0.96)), class = "data.frame", row.names = c(NA, -10L
))

【问题讨论】：

标签： r dplyr tidyverse

【解决方案1】：

以下内容可能会为您提供一些解决方法的想法：

（我没有将其推广到所有测试，因为我不确定是否所有测试都将p.value 存储在同一位置。）

library(dplyr)
library(tidyr)

lapply(which(sapply(df, is.factor)),
       function(i) df[, c(i, 5, 6)] %>%

         # set column names & extract group values into a separate label
         # so that the subsequent code can be used for all four columns
         # (the label's wording can be changed as desired)
         setNames(c("group", "pc1", "pc2")) %>%
         filter(!is.na(group)) %>% # filter out NA rows
         mutate(label = paste0("Column ", i, ": ",
                               paste0(unique(as.character(group)),
                                      collapse = " vs "))) %>%
         mutate(group = paste0("group", as.integer(group))) %>%

         # pivot data such that each group of pc1 / pc2 values is in its own column
         group_by(group) %>% 
         mutate(id = seq(1, n())) %>% 
         pivot_wider(id_cols = c(label, id), 
                     names_from = group, 
                     values_from = c(pc1, pc2)) %>%

         # perform separate tests on pc1 & pc2, and extract p-value in each case
         summarise(label = unique(label),
                   pc1 = wilcox.test(pc1_group1, pc1_group2)$p.value,
                   pc2 = wilcox.test(pc2_group1, pc2_group2)$p.value)) %>%

  # combine results from each group
  data.table::rbindlist()

# result:
                label       pc1       pc2
1: Column 1: bm vs cm 1.0000000 1.0000000
2: Column 2: ct vs ft 0.6904762 0.8412698
3: Column 3: un vs bn 0.8412698 1.0000000
4: Column 4: vp vs ls 0.6904762 0.5476190

【讨论】：

感谢@Z.Lin 的帮助。在我的真实数据框中，我有很多列，我想定义要包含哪些列进行测试。然而，当我尝试lapply(which(sapply(df[,c(10,15,20,22,65,55,40)], is.factor)) 时，我得到一个错误。我可以解决这个问题吗？也可以添加一条线来指示何时存在带有因子列的 NA，忽略 NA，因此总是有 2 个因子？
which(sapply(df, is.factor)) 行旨在返回与您要用作组变量的列相对应的列索引（假设它们是因子）。如果您更喜欢直接指定它们，您可以简单地将该部分替换为 c(10,15,20,22,65,55,40)。
澄清一下：如果因子列中有 NA 值，您希望从比较中删除 PC1/PC2 的相应行吗？
可以，PC1/PC2对应的行可以去掉。
@symo 已编辑我的答案以在数据处理期间过滤掉 NA 行。请注意，如果第一个和第二个因子值的删除行数不同，则 PC1 / PC2 中的 NA 将被传递到测试函数中。对于wilcox.test，这些将被自动过滤掉（我检查了底层stats:::wilcox.test.default 代码），但我不确定是否有其他可能的测试。