使用 dplyr/tidyverse 进行成对组合的多个测试答案

【问题标题】：Multiple tests with pairwise combinations using dplyr/tidyverse使用 dplyr/tidyverse 进行成对组合的多个测试
【发布时间】：2019-05-29 13:25:09
【问题描述】：

我的问题与this one 有关，但有一个更复杂的示例，我想在其中统计比较所有组合中的多列，并且每一列都有不同数量的样本。

考虑原始数据：

# A tibble: 51 x 3
   trial person score
   <chr> <chr>  <dbl>
 1 foo   a      0.266
 2 bar   b      0.372
 3 foo   c      0.573
 4 bar   a      0.908
 5 foo   b      0.202
 6 bar   c      0.898
 7 foo   a      0.945
 8 bar   b      0.661
 9 foo   c      0.629
10 foo   b      0.206

对于每种试验类型，我想运行一个统计测试来比较每个人的分数。所以，我需要以下测试结果：

试用foo，比较所有score A-B、B-C、C-A 人的样本
试用bar，比较所有score A-B、B-C、C-A 人的样本

当然，试炼不止两个，也不止三个人。

因此，在另一个问题中给出的使用group_split 的解决方案不起作用，因为它意味着始终测试第一人称（在我的情况下），而不是所有成对组合。

所以，在下面的代码中，我被困在了两点：

library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang
library(broom)

set.seed(1)

df = tibble::tibble(
    trial = rep(c("foo", "bar"), 30),
    person = rep(c("a", "b", "c"), 20),
    score = runif(60)
  ) %>% 
  filter(score > 0.2)

df %>% 
  group_by(person, trial) %>% 
  summarize(scores = list(score)) %>% 
  spread(person, scores) %>%
  group_split(trial) %>% 
  map_df(function(data) {
    data %>% 
      summarize_at(vars(b:c), function(x) {
        wilcox.test(.$a, x, paired = FALSE) %>% broom::tidy
      })
  })
#> Error in wilcox.test.default(.$a, x, paired = FALSE): 'x' must be numeric

^{由reprex package (v0.3.0) 于 2019 年 5 月 29 日创建}

x 的值显然不仅仅是实际的分数列表，而是单个试验的分数列向量。但是我不知道如何处理每个人的样本数量不同的事实。

另外，我仍然需要手动指定列名，如果有四个以上的人，这将是一场组合噩梦。

我可以通过某种方式得到这样的组合：

df %>% 
  group_split(trial) %>% 
  map_df(function(data) {
    combinations = expand(tibble(x = unique(data$person), y = unique(data$person)), x, y) %>% filter(x != y)
  })

…但这并不能真正帮助创建用于比较的列。

我能做些什么来完成这项工作？

【问题讨论】：

在有人提到很多统计成对比较需要 alpha 校正之前。

标签： r dplyr

【解决方案1】：

这将允许您以编程方式指定组合并解决您在wilcox.test() 中遇到的错误。

combos <- unique(df$person) %>%
  combn(2, simplify = F) %>%
  set_names(map_chr(., ~ paste(., collapse = "_")))

df %>% 
  group_split(trial) %>%
  set_names(map_chr(., ~ unique(.$trial))) %>% 
  map_df(function(x) {
    map_df(combos, function(y) {
      filter(x, person %in% y) %>% 
        wilcox.test(score ~ person, data = .) %>% 
        broom::tidy()
    }, .id = "contrast")
  }, .id = "trial")

# A tibble: 6 x 6
  trial contrast statistic p.value method                 alternative
  <chr> <chr>        <dbl>   <dbl> <chr>                  <chr>      
1 bar   a_b             34   0.878 Wilcoxon rank sum test two.sided  
2 bar   a_c             32   1     Wilcoxon rank sum test two.sided  
3 bar   b_c             31   0.959 Wilcoxon rank sum test two.sided  
4 foo   a_b             41   1     Wilcoxon rank sum test two.sided  
5 foo   a_c             41   1     Wilcoxon rank sum test two.sided  
6 foo   b_c             43   0.863 Wilcoxon rank sum test two.sided

由于这与您开始使用的模式有很大不同，我不确定它是否适用于您的实际案例，但它在这里有效，所以我想分享一下。

【讨论】：

有趣的方法，谢谢！正如您所提到的，我必须使其适应我的真实案例。就我而言，我可能需要不止一个组，所以可能会有group_split(trial, region)。然后我将如何设置拆分数据框的名称？
嗯，我看到有一个workaround here。
我会在你进入group_split()之前使用unite()来构建唯一的分组变量，然后在测试完成后使用separate()来拆分回原始变量

【解决方案2】：

这是一个替代解决方案，它使用嵌套来处理具有不同测量数量的组（人）。

library("broom")
library("tidyverse")

set.seed(1)

df <-
  tibble(
    trial = rep(c("foo", "bar"), 30),
    person = rep(c("a", "b", "c"), 20),
    score = runif(60)
  ) %>%
  filter(score > 0.2)

comparisons <- df %>%
  expand(
    trial,
    group1 = person,
    group2 = person
  ) %>%
  filter(
    group1 < group2
  )
comparisons
#> # A tibble: 6 × 3
#>   trial group1 group2
#>   <chr> <chr>  <chr> 
#> 1 bar   a      b     
#> 2 bar   a      c     
#> 3 bar   b      c     
#> 4 foo   a      b     
#> 5 foo   a      c     
#> 6 foo   b      c

df <- df %>% nest_by(trial, person)
df
#> # A tibble: 6 × 3
#> # Rowwise:  trial, person
#>   trial person               data
#>   <chr> <chr>  <list<tibble[,1]>>
#> 1 bar   a                 [8 × 1]
#> 2 bar   b                 [8 × 1]
#> 3 bar   c                 [8 × 1]
#> 4 foo   a                 [9 × 1]
#> 5 foo   b                 [9 × 1]
#> 6 foo   c                 [9 × 1]

comparisons %>%
  inner_join(
    df, by = c("trial", "group1" = "person")
  ) %>%
  inner_join(
    df, by = c("trial", "group2" = "person")
  ) %>%
  mutate(
    p.value = map2_dbl(
      data.x, data.y, ~ wilcox.test(.x$score, .y$score)$p.value
    )
  )
#> # A tibble: 6 × 6
#>   trial group1 group2             data.x             data.y p.value
#>   <chr> <chr>  <chr>  <list<tibble[,1]>> <list<tibble[,1]>>   <dbl>
#> 1 bar   a      b                 [8 × 1]            [8 × 1]   0.878
#> 2 bar   a      c                 [8 × 1]            [8 × 1]   1    
#> 3 bar   b      c                 [8 × 1]            [8 × 1]   0.959
#> 4 foo   a      b                 [9 × 1]            [9 × 1]   1    
#> 5 foo   a      c                 [9 × 1]            [9 × 1]   1    
#> 6 foo   b      c                 [9 × 1]            [9 × 1]   0.863

^{由reprex package 创建于 2022-03-17 (v2.0.1)}

【讨论】：