获取 R 中基因名称列表的基因 ID答案

【问题标题】：Get Gene IDs for a list of Gene Names in R获取 R 中基因名称列表的基因 ID
【发布时间】：2018-06-25 13:41:23
【问题描述】：

我有一个庞大的基因名称列表，我想将对应的基因 ID 映射到每个名称。我试过使用这个 R 库：org.Hs.eg.db，但它创建的 ID 多于名称，因此很难将结果映射在一起，尤其是在列表很长的情况下。

输入文件示例（7 个基因名称）：

RPS6KB2
PSME4
PDE4DIP
APMAP
TNRC18
PPP1R26
NAA20

理想的输出是（7 个 ID）：

当前输出（8 个 ID ！！）：

6199
23198
9659
57136
27320 *undesired output ID*
84629
9858
51126

对如何解决这个问题有什么建议吗？还是使用其他简单的工具来完成所需的任务（映射基因 ID）？

这是我正在使用的代码：

library("org.Hs.eg.db") #load the library

input <- read.csv("myfile.csv",TRUE,",") #read input file

GeneCol = as.character(input$Gene.name) #access the column that has gene names in my file

output = unlist(mget(x = GeneCol, envir = org.Hs.egALIAS2EG, ifnotfound=NA)) #get IDs

write.csv(output, file = "GeneIDs.csv") #write the list of IDs to a CSV file

【问题讨论】：

你当前的代码是什么？
这些 ID 应该从哪里来？你有某种查找表吗？
将代码添加到问题中

标签： r bioinformatics genetics

【解决方案1】：

在您的 org.Hs.eg.db 包上使用 mapIds()。但是您看到 8 个 id 的原因是符号之间的映射不是 1:1。您需要决定处理此类多张地图的策略。另外，请在 Bioconductor 支持网站https://support.bioconductor.org 上询问有关 Bioconductor 软件包的问题。

这是一个完整的例子（注意我不需要你的文件'myfile.csv'来运行它，所以很容易重现）

library(org.Hs.eg.db)
symbol <- c(
    "RPS6KB2", "PSME4", "PDE4DIP", "APMAP", "TNRC18",
    "PPP1R26", "NAA20"
)
mapIds(org.Hs.eg.db, symbol, "ENTREZID", "SYMBOL")

输出是

> mapIds(org.Hs.eg.db, symbol, "ENTREZID", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
RPS6KB2   PSME4 PDE4DIP   APMAP  TNRC18 PPP1R26   NAA20 
 "6199" "23198"  "9659" "57136" "84629"  "9858" "51126"

【讨论】：