【发布时间】:2020-11-19 11:21:43
【问题描述】:
我有两个大的 fasta 文件 - 它们的结构不同(如下所示),但读取的标题(以 @ 开头)在两个文件中是相同的:
文件1
>MN00153:75:000H37WNG:1:12106:12990:1333
AAAACCCC
>MN00153:75:000H37WNG:1:12106:21652:2374
AAAAGGGG
>MN00153:75:000H37WNG:1:12106:21652:2366
AGGGGGTT
文件2
>MN00153:75:000H37WNG:1:12106:12990:1333
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAGATCTCGCCC
>MN00153:75:000H37WNG:1:12106:21652:2374
AGATCTCGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>MN00153:75:000H37WNG:1:12106:21652:2366
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
我使用脚本从file1 的headers(键)和reads(值)制作了一个字典:
from Bio import SeqIO
dict={}
with open ('index2.fasta', 'r') as file1:
for record in SeqIO.parse(file1, 'fasta'):
dict[str(record.id)] = str(record.seq)
我所做的是遍历 file2 中的读取,如果 'AGATCTCG' 字符串在读取中,我将这些读取的标题保存在列表中。
现在我有一个问题是我想根据dictionary 和list 创建一个file2 的子文件。如果我的列表中的项目作为我的字典中的键存在并且值是 'AAAACCCC' 那么输出应该是 MN00153:75:000H37WNG:1:12106:12990:1333 但我得到 MN00153:75:000H37WNG:1:12106:12990:1333 和 MN00153:75:000H37WNG:1:12106:21652:2374
ATTACTCG_ids=[]
with open ('Read1.fasta', 'r') as file2:
for record in SeqIO.parse(file2, 'fasta'):
if 'AGATCTCG' in record.seq:
ATTACTCG_ids.append(record.id)
for i in ATTACTCG_ids:
if dict.get(i) == 'AAAACCCC':
final = record.format('fasta')
print(final)
有人可以帮我解决这个问题吗?
【问题讨论】:
-
您的标题不以
>开头,所以这不是 FASTA。@就像 FASTQ 标头(但没有质量分数等)。您似乎有一个损坏的文件格式,如果 Biopython 或任何其他非自定义模块正确解析它,我会感到惊讶 -
我刚刚修好了,谢谢你的提醒
标签: python bioinformatics intersection biopython seq