如何在 R 的数据框中搜索和压缩重复行？答案

【问题标题】：How can I search and condense repetitive rows in a data frame in R?如何在 R 的数据框中搜索和压缩重复行？
【发布时间】：2020-05-09 13:12:20
【问题描述】：

我正在使用 R 处理 RNA 测序数据，对此我很陌生。我正在使用来自 BioMart 的参考资料的数据框架，当包含 GO 术语时，这些参考资料的排列非常糟糕（如下所示）。

head(goZref)
      Gene.stable.ID Transcript.stable.ID  Protein.stable.ID
1 ENSDARG00000063344   ENSDART00000131829 ENSDARP00000123357
2 ENSDARG00000063344   ENSDART00000131829 ENSDARP00000123357
3 ENSDARG00000063344   ENSDART00000144883 ENSDARP00000114467
4 ENSDARG00000063344   ENSDART00000144883 ENSDARP00000114467
5 ENSDARG00000097685   ENSDART00000156963 ENSDARP00000128236
6 ENSDARG00000097685   ENSDART00000156963 ENSDARP00000128236
                                                            Gene.description         Gene.name WikiGene.name
1 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
2 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
3 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
4 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
5                      si:ch211-235i11.3 [Source:ZFIN;Acc:ZDB-GENE-131125-9] si:ch211-235i11.3  LOC101885363
6                      si:ch211-235i11.3 [Source:ZFIN;Acc:ZDB-GENE-131125-9] si:ch211-235i11.3  LOC101885363
                                                       GO.term.name
1                                                          membrane
2                                    integral component of membrane
3                                                          membrane
4                                    integral component of membrane
5                                              nucleic acid binding
6 RNA polymerase II regulatory region sequence-specific DNA binding

我想注释感兴趣基因的数据框（基因名称在此处称为genes 的字符向量中），但鉴于参考文献中的所有重复和行重复，我正在努力使其自动化。我试过使用match，但因为它只能找到我在其他行上错过的东西的第一个实例。例如，我想搜索“fam162a”并获得类似“膜，膜的组成部分”之类的内容，然后将其自动化以获得 100 个基因名称的列表。 subset 在给我多个具有相同基因名称标识符的行时很有用，我试图将它传递给 ddply 但我真的不知道我在做什么并被困在这里：

test<- ddply(.data = goZref, .variables = genes, for (x in genes) {
+ paste(unique(subset(goZref, WikiGene.name==x, select= Go.term.name)), sep = ",")})
Error in parse(text = x) : <text>:1:12: unexpected symbol
1: si:dkey-224k5.13
               ^

编辑： 我想要的输出类似于我输入的 100 个基因名称和来自 Go.description 列中所有相关行的相应信息的矩阵，例如，如果 fam162a 和 LOC101885363 是列表中的基因输出将是：

1 fam162a       membrane,integral component of membrane
2 LOC101885363  nucleic acid binding,RNA polymerase II regulatory region...

感谢任何帮助！

【问题讨论】：

请显示您的预期输出

标签： r

【解决方案1】：

这是dplyr 解决方案：

数据：

df <- data.frame(Gene.name = c("fam162a", "fam162a", "fam162a", "fam162a", "LOC101885363", "LOC101885363"),
                 Gene.info = c("membrane","integral component of membrane", "membrane", "integral component of membrane",
                               "nucleic acid binding", "RNA polymerase II regulatory region sequence-specific DNA binding"),
                 stringsAsFactors = F)

解决方案：

df %>% 
  group_by(Gene.name) %>% 
  mutate(Gene.info.complete = paste0(unique(Gene.info), collapse = ","))

# A tibble: 6 x 3
# Groups:   Gene.name [2]
  Gene.name    Gene.info                                       Gene.info.complete                                             
  <chr>        <chr>                                           <chr>                                                          
1 fam162a      membrane                                        membrane,integral component of membrane                        
2 fam162a      integral component of membrane                  membrane,integral component of membrane                        
3 fam162a      membrane                                        membrane,integral component of membrane                        
4 fam162a      integral component of membrane                  membrane,integral component of membrane                        
5 LOC101885363 nucleic acid binding                            nucleic acid binding,RNA polymerase II regulatory region seque~
6 LOC101885363 RNA polymerase II regulatory region sequence-s~ nucleic acid binding,RNA polymerase II regulatory region seque~

【讨论】：

【解决方案2】：

好的 - 我想你可能正在寻找这样的东西......

我制作了一个最小的示例数据集 - test_data 包含一些基因和一些注释。

test_data=data.frame(gene_name=rep(c("gene1","gene2","gene3"),each=4),
    annotation=c("important","interesting","useful","cool","useless","unimportant","boring","dull","neutral","so-so","borderline","average"))

假设您在一个载体中有您感兴趣的基因名称列表：

gene_name_list=c("gene1","gene2")

我们可以使用它来获取每个注释的所有注释，用逗号分隔：

gene_annotations = sapply(gene_name_list,function(gene_name) paste( unique( test_data[test_data[,"gene_name"]==gene_name,"annotation"]), collapse="," ) )
gene_annotations
#                               gene1                               gene2 
# "important,interesting,useful,cool"   "useless,unimportant,boring,dull"

【讨论】：