【问题标题】:How can I search and condense repetitive rows in a data frame in R?如何在 R 的数据框中搜索和压缩重复行?
【发布时间】:2020-05-09 13:12:20
【问题描述】:

我正在使用 R 处理 RNA 测序数据,对此我很陌生。我正在使用来自 BioMart 的参考资料的数据框架,当包含 GO 术语时,这些参考资料的排列非常糟糕(如下所示)。

head(goZref)
      Gene.stable.ID Transcript.stable.ID  Protein.stable.ID
1 ENSDARG00000063344   ENSDART00000131829 ENSDARP00000123357
2 ENSDARG00000063344   ENSDART00000131829 ENSDARP00000123357
3 ENSDARG00000063344   ENSDART00000144883 ENSDARP00000114467
4 ENSDARG00000063344   ENSDART00000144883 ENSDARP00000114467
5 ENSDARG00000097685   ENSDART00000156963 ENSDARP00000128236
6 ENSDARG00000097685   ENSDART00000156963 ENSDARP00000128236
                                                            Gene.description         Gene.name WikiGene.name
1 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
2 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
3 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
4 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
5                      si:ch211-235i11.3 [Source:ZFIN;Acc:ZDB-GENE-131125-9] si:ch211-235i11.3  LOC101885363
6                      si:ch211-235i11.3 [Source:ZFIN;Acc:ZDB-GENE-131125-9] si:ch211-235i11.3  LOC101885363
                                                       GO.term.name
1                                                          membrane
2                                    integral component of membrane
3                                                          membrane
4                                    integral component of membrane
5                                              nucleic acid binding
6 RNA polymerase II regulatory region sequence-specific DNA binding

我想注释感兴趣基因的数据框(基因名称在此处称为genes 的字符向量中),但鉴于参考文献中的所有重复和行重复,我正在努力使其自动化。我试过使用match,但因为它只能找到我在其他行上错过的东西的第一个实例。例如,我想搜索“fam162a”并获得类似“膜,膜的组成部分”之类的内容,然后将其自动化以获得 100 个基因名称的列表。 subset 在给我多个具有相同基因名称标识符的行时很有用,我试图将它传递给 ddply 但我真的不知道我在做什么并被困在这里:

test<- ddply(.data = goZref, .variables = genes, for (x in genes) {
+ paste(unique(subset(goZref, WikiGene.name==x, select= Go.term.name)), sep = ",")})
Error in parse(text = x) : <text>:1:12: unexpected symbol
1: si:dkey-224k5.13
               ^

编辑: 我想要的输出类似于我输入的 100 个基因名称和来自 Go.description 列中所有相关行的相应信息的矩阵,例如,如果 fam162a 和 LOC101885363 是列表中的基因输出将是:

1 fam162a       membrane,integral component of membrane
2 LOC101885363  nucleic acid binding,RNA polymerase II regulatory region... 

感谢任何帮助!

【问题讨论】:

  • 请显示您的预期输出

标签: r


【解决方案1】:

这是dplyr 解决方案:

数据:

df <- data.frame(Gene.name = c("fam162a", "fam162a", "fam162a", "fam162a", "LOC101885363", "LOC101885363"),
                 Gene.info = c("membrane","integral component of membrane", "membrane", "integral component of membrane",
                               "nucleic acid binding", "RNA polymerase II regulatory region sequence-specific DNA binding"),
                 stringsAsFactors = F)

解决方案:

df %>% 
  group_by(Gene.name) %>% 
  mutate(Gene.info.complete = paste0(unique(Gene.info), collapse = ","))

# A tibble: 6 x 3
# Groups:   Gene.name [2]
  Gene.name    Gene.info                                       Gene.info.complete                                             
  <chr>        <chr>                                           <chr>                                                          
1 fam162a      membrane                                        membrane,integral component of membrane                        
2 fam162a      integral component of membrane                  membrane,integral component of membrane                        
3 fam162a      membrane                                        membrane,integral component of membrane                        
4 fam162a      integral component of membrane                  membrane,integral component of membrane                        
5 LOC101885363 nucleic acid binding                            nucleic acid binding,RNA polymerase II regulatory region seque~
6 LOC101885363 RNA polymerase II regulatory region sequence-s~ nucleic acid binding,RNA polymerase II regulatory region seque~

【讨论】:

    【解决方案2】:

    好的 - 我想你可能正在寻找这样的东西......

    我制作了一个最小的示例数据集 - test_data 包含一些基因和一些注释。

    test_data=data.frame(gene_name=rep(c("gene1","gene2","gene3"),each=4),
        annotation=c("important","interesting","useful","cool","useless","unimportant","boring","dull","neutral","so-so","borderline","average"))
    

    假设您在一个载体中有您感兴趣的基因名称列表:

    gene_name_list=c("gene1","gene2")
    

    我们可以使用它来获取每个注释的所有注释,用逗号分隔:

    gene_annotations = sapply(gene_name_list,function(gene_name) paste( unique( test_data[test_data[,"gene_name"]==gene_name,"annotation"]), collapse="," ) )
    gene_annotations
    #                               gene1                               gene2 
    # "important,interesting,useful,cool"   "useless,unimportant,boring,dull" 
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-08-27
      • 1970-01-01
      • 2020-03-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多