【问题标题】:Doing chisq.test on data frame for multiple pairwise comparisons对数据框进行 chisq.test 以进行多次成对比较
【发布时间】:2017-09-21 10:47:40
【问题描述】:

我有以下数据框:

species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
df <- cbind(species, category, minus, plus)
df<-as.data.frame(df)

我想为每个类别-物种组合做一个 chisq.test,像这样:

物种 a,类别 h 和 l:p 值

物种 a,类别 h 和 m:p 值

物种 a,类别 l 和 m:p 值

物种 b,... 等等

使用以下 chisq.test(虚拟代码):

chisq.test(c(minus(cat1, cat2),plus(cat1, cat2)))$p.value

我想最终得到一个表格,显示每个比较的每个 chisq.test p 值,如下所示:

Species   Category1  Category2   p-value
a         h          l           0.05
a         h          m           0.2
a         l          m           0.1
b...

其中 category 和 category 2 是 chisq.test 中比较的类别。

使用 dplyr 可以做到这一点吗?我已经尝试调整 herehere 中提到的内容,但正如我所见,它们并不真正适用于这个问题。

编辑:我还想看看如何为以下数据集完成此操作:

species <- c(1:11)
minus <- c(132,78,254,12,45,76,89,90,100,42,120)
plus <- c(1,2,0,0,0,3,2,5,6,4,0)

我想做一个chisq。将表中的每个物种与表中的每个其他物种进行比较(所有物种的每个物种之间的成对比较)。我想得到这样的结果:

species1  species2  p-value
1         2         0.5
1         3         0.7
1         4         0.2
...
11        10        0.02

我尝试将上面的代码更改为以下代码:

species_chisq %>%
do(data_frame(species1 = first(.$species),
            species2 = last(.$species),
            data = list(matrix(c(.$minus, .$plus), ncol = 2)))) %>%
mutate(chi_test = map(data, chisq.test, correct = FALSE)) %>%
mutate(p.value = map_dbl(chi_test, "p.value")) %>%
ungroup() %>%
select(species1, species2, p.value) %>%

但是,这仅创建了一个表,其中每个物种仅与自身进行比较,而不是与其他物种进行比较。我不太明白在@ycw 给出的原始代码中它指定了比较的位置。

编辑 2:

我通过here找到的代码设法做到了这一点。

【问题讨论】:

    标签: r dataframe chi-squared


    【解决方案1】:

    来自dplyrpurrr 的解决方案。请注意,我不熟悉卡方检验,但我遵循您在@Vincent Bonhomme 的帖子中指定的方式:chisq.test(test, correct = FALSE)

    另外,要创建示例数据框,不需要使用cbind,只需data.frame 就足够了。 stringsAsFactors = FALSE 对于防止列成为因素很重要。

    # Create example data frame
    species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
    category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
    minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
    plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
    df <- data.frame(species, category, minus, plus, stringsAsFactors = FALSE)
    
    # Load packages
    library(dplyr)
    library(purrr)
    
    # Process the data
    df2 <- df %>%
      group_by(species) %>%
      slice(c(1, 2, 1, 3, 2, 3)) %>%
      mutate(test = rep(1:(n()/2), each = 2)) %>%
      group_by(species, test) %>%
      do(data_frame(species = first(.$species),
                    test = first(.$test[1]),
                    category1 = first(.$category),
                    category2 = last(.$category),
                    data = list(matrix(c(.$minus, .$plus), ncol = 2)))) %>%
      mutate(chi_test = map(data, chisq.test, correct = FALSE)) %>%
      mutate(p.value = map_dbl(chi_test, "p.value")) %>%
      ungroup() %>%
      select(species, category1, category2, p.value)
    
    df2
    # A tibble: 25 x 4
       species category1 category2   p.value
         <chr>     <chr>     <chr>     <dbl>
     1       a         h         l 0.3465104
     2       a         h         m 0.1354680
     3       a         l         m 0.6040227
     4       b         h         l 0.2339414
     5       b         h         m 0.4798647
     6       b         l         m 0.4399181
     7       c         h         l 0.4714005
     8       c         h         m 0.6987413
     9       c         l         m 0.5729834
    10       d         h         l 0.2196806
    # ... with 15 more rows
    

    【讨论】:

    • 您能否详细说明 slice(c(1,2,1,3,2,3)) 命令的作用?我不明白 ?slice 帮助。
    • 我也在尝试将这项工作用于只有物种的数据框,我在每个物种之间进行成对比较(这次没有类别)。一栏是物种,一栏是减号,一栏是加号(与上面类似)。如果我想将每个物种相互比较,并创建一个与上面类似的值表,你将如何更改代码?由于我不太确定 slice() 是如何工作的,我发现很难更改它!
    • 根据索引号切片选择行。所以 slice(c(1, 2, 1, 3, 2, 3)) 表示获取第一、二、一、三、二、三行。
    • 谢谢!我尝试将其更改为仅涵盖我在上面的评论中提出的内容,但我没有成功。您对此有解决方案吗?
    【解决方案2】:

    首先,您应该使用data.frame 创建您的data.frame,否则minusplus 列将变成factors。

    species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
    category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
    minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
    plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
    df <- data.frame(species=species, category=category, minus=minus, plus=plus)
    

    那么,我不确定是否有纯粹的dplyr 方式来做到这一点(很高兴看到相反的情况),但我认为这是一种部分-dplyr 方式来做到这一点:

    df_combinations <-
      # create a df with all interactions
      expand.grid(df$species, df$category, df$category)) %>% 
      # rename columns
      `colnames<-`(c("species", "category1", "category2")) %>% 
      # 3 lines below:
      # manage to only retain within a species, category(1 and 2) columns
      # with different values
      unique %>% 
      group_by(species) %>% 
      filter(category1 != category2) %>% 
      # cosmetics
      arrange(species, category1, category2) %>%
      ungroup() %>% 
      # prepare an empty column
      mutate(p.value=NA)
    
    # now we loop to fill your result data.frame
    for (i in 1:nrow(df_combinations)){
      # filter appropriate lines
      cat1 <- filter(df,
                     species==df_combinations$species[i],
                     category==df_combinations$category1[i])
      cat2 <- filter(df,
                     species==df_combinations$species[i],
                     category==df_combinations$category2[i])
      # calculate the chisq.test and assign its p-value to the right line
      df_combinations$p.value[i] <- chisq.test(c(cat1$minus, cat2$minus,
                                                 cat1$plus, cat2$plus))$p.value  
    
    }
    

    让我们看看生成的data.frame

    head(df_combinations)
    # A tibble: 6 x 4
    # A tibble: 6 x 4
    # Groups:   species [1]
    species category1 category2       p.value
    <fctr>    <fctr>    <fctr>         <dbl>
    1       a         h         l  3.290167e-11
    2       a         h         m 1.225872e-134
    3       a         l         h  3.290167e-11
    4       a         l         m 5.824842e-150
    5       a         m         h 1.225872e-134
    6       a         m         l 5.824842e-150
    

    检查第一行: chisq.test(c(31, 14, 2, 0))$p.value [1] 3.290167e-11

    这是你想要的吗?

    【讨论】:

    • 好建议!这确实是我正在寻找的。但是,当我分别对每个类别对进行单个 chisq.test 时,我似乎没有得到相同的 p 值,而且这里的几乎每个组合都很重要!你能想到什么原因吗?
    • 你能举一个你想要的 chisq.test 的(真实)例子吗?
    • 别担心,明白了。 cbind 然后 as.data.frame 将数值转换为因子。我修改我的答案。
    • 这是我通常的做法:test
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-02-22
    • 2016-12-29
    • 2021-08-16
    • 2020-05-20
    相关资源
    最近更新 更多