【问题标题】:How to extract column names based on a value in a output column and obtain counts如何根据输出列中的值提取列名并获取计数
【发布时间】:2020-07-09 08:24:03
【问题描述】:

我有一个关于 R 中的数据框操作的问题,该操作根据输出列中用逗号分隔的值提取列名并获取计数。

我有一个输入文件,其中 A 列中包含基因,其他列中包含文献 ID(输入文件示例如下所示)。我想要的是收集所有在输出列中具有value = 1 的文献 ID,并计算计数列中的 ID 数量(输出文件示例如下所示)。发布此消息后,我将使用此输出文件将数据帧与我感兴趣的基因列表使用merge 函数合并。请帮我解决这个问题。

Input_data <- read.csv(file = "./Input.csv", stringsAsFactors = FALSE, check.names = FALSE)
Output_data <- read.csv(file = "./Output.csv", stringsAsFactors = FALSE, check.names = FALSE)
Genes <- read.csv(file = "./Genes.csv", stringsAsFactors = FALSE, check.names = FALSE)

Merge_data <- merge(Output_data, Genes, by = "Genes")


Input_data

dput(Input_data)
structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
"Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
"Gene_L", "Gene_M"), `20706538` = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 
1L, 0L, 0L, 0L, 0L, 0L), `14557386` = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), `22999554` = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `21906313` = c(1L, 1L, 1L, 1L, 
0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L), `25229268` = c(1L, 1L, 1L, 
0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `22633082` = c(0L, 1L, 
1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `19228761` = c(1L, 
1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), `19543402` = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `26955776` = c(1L, 
1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `21126355` = c(1L, 
1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, 
-13L))


Output_data

dput(Output_data)
structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
"Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
"Gene_L", "Gene_M"), Output = c("21906313, 25229268, 19228761, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355", 
"20706538, 21906313, 22633082, 19228761, 26955776, 21126355", 
"", "20706538, 21906313, 25229268, 22633082, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 26955776, 21126355", 
"", "", "", "", "21906313, 21126355"), Counts = c(5L, 7L, 7L, 
6L, 0L, 6L, 7L, 6L, 0L, 0L, 0L, 0L, 2L)), class = "data.frame", row.names = c(NA, 
-13L))

Genes
dput(Genes)
structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
"Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
"Gene_L", "Gene_M", "Gene_N", "Gene_O", "Gene_P", "Gene_Q", "Gene_R", 
"Gene_S", "Gene_T", "Gene_U", "Gene_V", "Gene_W")), class = "data.frame", row.names = c(NA, 
-23L))

【问题讨论】:

    标签: r dataframe merge dplyr tidyr


    【解决方案1】:

    您的数据采用宽格式,这意味着一行/观察有多个值。当您的数据采用长格式时会更容易,这意味着每行只有一个值。看看tidy data

    我的解决方案与@Ric S 非常相似,而不是mutate,我使用summarise,这是为您希望分组变量的每一级只有一个条目的情况而设计的:

    Input_data <- structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
                             "Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
                             "Gene_L", "Gene_M"), `20706538` = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 
                                                                 1L, 0L, 0L, 0L, 0L, 0L), `14557386` = c(0L, 0L, 0L, 0L, 0L, 0L, 
                                                                                                         0L, 0L, 0L, 0L, 0L, 0L, 0L), `22999554` = c(0L, 0L, 0L, 0L, 0L, 
                                                                                                                                                     0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `21906313` = c(1L, 1L, 1L, 1L, 
                                                                                                                                                                                                     0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L), `25229268` = c(1L, 1L, 1L, 
                                                                                                                                                                                                                                                         0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `22633082` = c(0L, 1L, 
                                                                                                                                                                                                                                                                                                                 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `19228761` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                             1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), `19543402` = c(0L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                             0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `26955776` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `21126355` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   -13L))
    
    Genes <- structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
                                      "Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
                                      "Gene_L", "Gene_M", "Gene_N", "Gene_O", "Gene_P", "Gene_Q", "Gene_R", 
                                      "Gene_S", "Gene_T", "Gene_U", "Gene_V", "Gene_W")), class = "data.frame", row.names = c(NA, 
                                                                                                                              -23L))
    
    library(dplyr)
    library(tidyr)
    
    summary_data <- Input_data %>% 
      pivot_longer(-Genes, values_to = "is_contained", names_to = "literature_id") %>% 
      group_by(Genes) %>% 
      filter(is_contained == 1) %>% 
      summarise(Output = paste0(literature_id, collapse = ", "),
                Counts = n()) %>% 
      right_join(Genes) %>% 
      mutate(Output = if_else(is.na(Output),
                              "",
                              Output),
             Counts = if_else(is.na(Counts),
                              0L,
                              Counts))
    
    summary_data
    # A tibble: 23 x 3
       Genes  Output                                                                 Counts
       <chr>  <chr>                                                                   <int>
     1 Gene_A "21906313, 25229268, 19228761, 26955776, 21126355"                          5
     2 Gene_B "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
     3 Gene_C "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
     4 Gene_D "20706538, 21906313, 22633082, 19228761, 26955776, 21126355"                6
     5 Gene_E ""                                                                          0
     6 Gene_F "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"                6
     7 Gene_G "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
     8 Gene_H "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"                6
     9 Gene_I ""                                                                          0
    10 Gene_J ""                                                                          0
    # ... with 13 more rows
    

    【讨论】:

    • 通过您的解决方案,我更好地理解了在这种情况下使用summarise 而不是mutate 的优势,谢谢@starja! +1
    • @starja,这对我的大规模数据有帮助。我发现有数百个具有相同信息的重复基因。如何仅提取列中的唯一基因而不是重复基因。谢谢你,图菲克
    • 试试summary_data %&gt;% distinct(Genes, .keep_all = TRUE)
    【解决方案2】:

    这是使用包tidyrdplyr 的可能解决方案。

    基本上,我们首先确保您的数据是tidy,即您可以使用pivot_longer 函数以更简单的方式使用它,然后我们应用非常标准的dplyr 语句来创建我们想要的输出.如果您不熟悉它们,我建议您一次运行一个管道步骤,并了解每个段落的作用。

    library(tidyr)
    library(dplyr)
    
    Input_data %>% 
      pivot_longer(-Genes, names_to = "num", values_to = "value") %>%
      group_by(Genes) %>% 
      mutate(
        Output = paste(num[value == 1], collapse = ", "),
        Counts = sum(value == 1)
        ) %>% 
      select(-c(num, value)) %>% 
      distinct() %>% 
      right_join(Genes, by = "Genes")
    

    输出

    # A tibble: 23 x 3
    # Groups:   Genes [23]
    #    Genes  Output                                                                 Counts
    #    <chr>  <chr>                                                                  <int>
    #  1 Gene_A "21906313, 25229268, 19228761, 26955776, 21126355"                         5
    #  2 Gene_B "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
    #  3 Gene_C "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
    #  4 Gene_D "20706538, 21906313, 22633082, 19228761, 26955776, 21126355"               6
    #  5 Gene_E ""                                                                         0
    #  6 Gene_F "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"               6
    #  7 Gene_G "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
    #  8 Gene_H "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"               6
    #  9 Gene_I ""                                                                         0
    # 10 Gene_J ""                                                                         0
    # ... with 13 more rows
    

    【讨论】:

      【解决方案3】:

      使用data.table

      library(data.table)
      setDT(Genes)
      setDT(Input_data)
      
      Output_data <- 
        Input_data[, melt(.SD, id.vars = "Genes", variable.name = "id")
                   ][value == 1, .(Output = toString(id), Counts = .N), by = Genes
                     ][Genes, on = "Genes"
                       ][is.na(Counts), c("Output", "Counts") := .("", 0L)]
      
      #      Genes                                                               Output Counts
      #  1: Gene_A                     21906313, 25229268, 19228761, 26955776, 21126355      5
      #  2: Gene_B 20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355      7
      #  3: Gene_C 20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355      7
      #  4: Gene_D           20706538, 21906313, 22633082, 19228761, 26955776, 21126355      6
      #  5: Gene_E                                                                           0
      #  6: Gene_F           20706538, 21906313, 25229268, 22633082, 26955776, 21126355      6
      #  7: Gene_G 20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355      7
      #  8: Gene_H           20706538, 21906313, 25229268, 22633082, 26955776, 21126355      6
      #  9: Gene_I                                                                           0
      # 10: Gene_J                                                                           0
      # 11: Gene_K                                                                           0
      # 12: Gene_L                                                                           0
      # 13: Gene_M                                                   21906313, 21126355      2
      # 14: Gene_N                                                                           0
      # 15: Gene_O                                                                           0
      # 16: Gene_P                                                                           0
      # 17: Gene_Q                                                                           0
      # 18: Gene_R                                                                           0
      # 19: Gene_S                                                                           0
      # 20: Gene_T                                                                           0
      # 21: Gene_U                                                                           0
      # 22: Gene_V                                                                           0
      # 23: Gene_W                                                                           0
      #      Genes                                                               Output Counts
      

      【讨论】:

        猜你喜欢
        • 2022-01-08
        • 2021-11-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多