【问题标题】:merge multiple files based on column and print nth column R基于列合并多个文件并打印第n列R
【发布时间】:2018-08-28 20:24:59
【问题描述】:

我有 3 个文件。我需要获取第一个文件,并且对于每一行,需要匹配文件 2 中的第一列。然后从 file2 中获取相应的别名并将其与 file3(描述或别名列)匹配,然后打印 OMIM Id。

File1:

**Symbol**
MCL1
ABCB1
BAX
IKZF1
WWOX
BCL2L1
BCL2L11
CCND1
TNFSF10

File2:

**Symbol2   Aliases**
MCL1    MCL1, BCL2 family apoptosis regulator
ABCB1   ATP binding cassette subfamily B member 1
WWOX    WW domain containing oxidoreductase
BCL2L1  RB transcriptional corepressor 1
BOK peroxisome proliferator activated receptor gamma
RHOA    ras homolog family member A
ABCC1   C-X-C motif chemokine ligand 12
PARP1   poly(ADP-ribose) polymerase 1
BAK1    BRCA1, DNA repair associated

file3:
**description   OMIM    Aliases**
MCL1, BCL2 family apoptosis regulator   159552  G protein subunit alpha 12
ATP binding cassette subfamily B member 1   171050  matrix metallopeptidase 9
BCL2 associated X, apoptosis regulator  600040  cadherin 1
IKAROS family zinc finger 1 603023  Janus kinase 2
WW domain containing oxidoreductase 605131  ataxin 3
BCL2 like 1 600039  RB transcriptional corepressor 1
BCL2 like 11    603827  transferrin receptor
cyclin D1   168461  C-C motif chemokine ligand 2
TNF superfamily member 10   603598  prostaglandin-endoperoxide synthase 2

Expected result:
**Symbol    Symbol1 description/Aliases OMIM**
MCL1    MCL1    MCL1, BCL2 family apoptosis regulator   159552
ABCB1   ABCB1   ATP binding cassette subfamily B member 1   171050
BAX         
IKZF1           
WWOX    WWOX    WW domain containing oxidoreductase 605131
BCL2L1  BCL2L1  RB transcriptional corepressor 1    600039
BCL2L11         
CCND1           
TNFSF10         

我使用了merge 和inner_join,但没有达到预期。有什么帮助吗?

【问题讨论】:

  • 您能否分享您的代码以便识别问题
  • 我使用了file1_2=merge(x = file1, y = file2, by = c("Symbol","Symbol2"), all=TRUE),然后是output = merge(x=file1_2, y=file3)。但我想在描述/别名中匹配并附加相应的 OMIM 列。
  • 在下面查看我的答案。好像你的合并函数有错误

标签: r join merge


【解决方案1】:

另一种可能性是重命名您要合并的相关列,然后使用 purrr::reducedplyr::left_join(或在基础 R 中 Reducemerge

names(df2) <- c("Symbol", "Description/Aliases")
names(df3) <- c("Description/Aliases", "OMIM", "Aliases")

purrr::reduce(list(df1, df2, df3), dplyr::left_join) %>% dplyr::select(-Aliases)
#   Symbol                       Description/Aliases   OMIM
#1    MCL1     MCL1, BCL2 family apoptosis regulator 159552
#2   ABCB1 ATP binding cassette subfamily B member 1 171050
#3     BAX                                      <NA>     NA
#4   IKZF1                                      <NA>     NA
#5    WWOX       WW domain containing oxidoreductase 605131
#6  BCL2L1          RB transcriptional corepressor 1     NA
#7 BCL2L11                                      <NA>     NA
#8   CCND1                                      <NA>     NA
#9 TNFSF10                                      <NA>     NA

或者在基础R中

Reduce(function(x, y) merge(x, y, all.x = T), list(df1, df2, df3))

样本数据

df1 <- read.table(text =
    "Symbol
MCL1
ABCB1
BAX
IKZF1
WWOX
BCL2L1
BCL2L11
CCND1
TNFSF10", header = T)

df2 <- read.table(text =
    "Symbol2   Aliases
MCL1    'MCL1, BCL2 family apoptosis regulator'
ABCB1   'ATP binding cassette subfamily B member 1'
WWOX    'WW domain containing oxidoreductase'
BCL2L1  'RB transcriptional corepressor 1'
BOK 'peroxisome proliferator activated receptor gamma'
RHOA    'ras homolog family member A'
ABCC1   'C-X-C motif chemokine ligand 12'
PARP1   'poly(ADP-ribose) polymerase 1'
BAK1    'BRCA1, DNA repair associated'", header = T)

df3 <- read.table(text =
    "description   OMIM    Aliases
'MCL1, BCL2 family apoptosis regulator'   159552  'G protein subunit alpha 12'
'ATP binding cassette subfamily B member 1'   171050  'matrix metallopeptidase 9'
'BCL2 associated X, apoptosis regulator'  600040  'cadherin 1'
'IKAROS family zinc finger 1' 603023  'Janus kinase 2'
'WW domain containing oxidoreductase' 605131  'ataxin 3'
'BCL2 like 1' 600039  'RB transcriptional corepressor 1'
'BCL2 like 11'    603827  'transferrin receptor'
'cyclin D1'   168461  'C-C motif chemokine ligand 2'
'TNF superfamily member 10'   603598  'prostaglandin-endoperoxide synthase 2'", header = T)

【讨论】:

    【解决方案2】:

    您的merge 语句中有错误。语法为merge(x, y, by.x, by.y, all)。所以你的代码会是这样的:

    df1 <- merge(file_1, file_2, by.x = "Symbol", by.y = "Symbol2", all.x = TRUE)
    df2 <- merge(df1, file_3, by.x = "Aliases", by.y = "description", all.x = TRUE)
    

    【讨论】:

    • 谢谢,我如何保留 file2 中的描述/别名(如预期结果所示)
    • 我认为 df2 应该包含所有 3 个文件中的所有列。不过它可能已经更改了名称(例如 Aliases.x)
    • 我使用了你的代码,这是我的输出文件 colnames :Aliases Symbol OMIM Aliases.y,我在这里做错了什么?
    • 这是结果数据集中的“别名”列。 “Aliases.y”是 file_3 中的别名列
    猜你喜欢
    • 1970-01-01
    • 2019-07-13
    • 1970-01-01
    • 1970-01-01
    • 2013-03-28
    • 2015-11-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多