【发布时间】:2019-03-01 19:13:24
【问题描述】:
我有一个 data.frame,其中包含一堆具有唯一 ID 的行,后跟一个氨基酸序列。我想知道是否有办法将行拆分为单独的唯一 data.frame。
这是一个例子
bigdf
>ENSCAFP00000018847.4
FGHFGHFGHFGHFHFGHFGHFGHFGHFHFGHFGHFHFGHFGHFHFHFHFGTNSEGAFQTA
DLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
>ENSCAFP00000018847.3
VCXVNSFRERYTRIOUHFSDAADSSAASAAAAGPVVTANHVEEPAMTPGVRTNSEGAFQTA
DLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
>ENSCAFP00000018847.2
ASDASDADASDASDASDASDASSADASASRPGPVVTANHVEEPAMTPGVRTNSEGAFQTA
DLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
>ENSCAFP00000018847.1
QWEQWEQWEQWEWQREWRQWEQWRQRQQRERPGPVVTANHVEEPAMTPGVRTNSEGAFQTA
DLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
如果我可以将新 data.frames 的名称作为它们的 ID,那就太好了,希望结果看起来像这样
ENSCAFP00000018847.4
>ENSCAFP00000018847.4
FGHFGHFGHFGHFHFGHFGHFGHFGHFHFGHFGHFHFGHFGHFHFHFHFGTNSEGAFQTA
DLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
ENSCAFP00000018847.3
>ENSCAFP00000018847.3
VCXVNSFRERYTRIOUHFSDAADSSAASAAAAGPVVTANHVEEPAMTPGVRTNSEGAFQTA
DLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
ENSCAFP00000018847.2
>ENSCAFP00000018847.2
ASDASDADASDASDASDASDASSADASASRPGPVVTANHVEEPAMTPGVRTNSEGAFQTA
DLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
ENSCAFP00000018847.1
>ENSCAFP00000018847.1 QWEQWEQWEQWEWQREWRQWEQWRQRQQRERPGPVVTANHVEEPAMTPGVRTNSEGAFQTADLLETSVPSHMPLETQTLSPQTFDWTLILANSNSEAETRDTKTTFPAMEGRAFTKMTPSK
我知道这应该是一件奇怪的事情,但需要对数千个不同的氨基酸序列执行此操作,所以如果我能找到一种方法将它们全部拆分到 R 中会很酷
dput(df[1:3, c(1)])
c("> ENSCAFP00000018847.4 MFFINIISLIIPILLAVAFLTLVERKVLGYMQLRKGPNIVGPYGLLQPIADAVKLFTKEPLRPLTSSMSMFILAPILALSLALTMWIPLPMPYPLINMNLGVLFMLAMSSLAVYSILWSGWASNSKYALIGALRAVAQTISYEVTLAIILLSVLLMNGSFTLSTLIITQEHMWLIFPAWPLAMMWFISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFALFFLAEYANIIMMNILTTILFFGAFHNPFMPELYSINFTMKTLLLTICFLWIRASYPRFRYDQLMHLLWKNFLPLTLALCMWHVALPIITASIPPQT",
"> ENSCAFP00000018847.3 MKPPILIIIMATIMTGTMIVMLSSHWLLIWIGFEMNMLAIIPILMKKYNPRAMEASTKYFLTQATASMLLMMGVTINLLYSGQWVISKISNPIASIMMTTALTMKLGLSPFHFWVPEVTQGITLMSGMILLTWQKIAPMSILYQISPSINTNLLMLMALTSVLVGGWGGLNQTQLRKIMAYSSIAHMGWMAAIITYNPTMMVLNLTLYILMTLSTFMLFMLNSSTTTLSLSHMWNKFPLITSMILILMLSLGGLPPLSGFIPKWMIIQELTKNNMIIIPTLMAITALLNLYFYLRLTYSTALTMFPSTNNMKMKWQFEYTKKATLLPPLIITSTMLLPLTPMLSVLD",
"> ENSCAFP00000018847.2 MFINRWLFSTNHKDIGTLYLLFGAWAGMVGTALSLLIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSMVEAGAGTGWTVYPPLAGNLAHAGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGGNIKWSPAMLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFAHWFPLFSGYTLNDTWAKIHFTIMFVGVNMTFFPQHFLGLSGMPRRYSDYPDAYTTWNTVSSMGSFISLTAVMLMIFMIWEAFASKREVAMVELTTTNIEWLHGCPPPYHTFEEPTYVIQK"
)
【问题讨论】:
-
bigdf只有一栏吗? ID后面的两行是连续的字符串还是两个字符串? -
没错,它只有一列。它是一个连续的字符串
-
您是否反对将它们全部保存在一个具有 ID 列和序列列的数据框中?它比加载不同的数据帧要简洁得多。
-
您可以与
dput共享几个示例行吗? @Lyngbakr 建议的更好方法是, -
您的
dput只分享了一栏。是否还有其他列需要我们担心,或者您只想将这一列向量转换为多个数据框?
标签: r