使用biopython从fasta文件中检索序列时出现IOError答案

【问题标题】：IOError while retrieving sequences from fasta file using biopython使用biopython从fasta文件中检索序列时出现IOError
【发布时间】：2015-12-01 07:44:25
【问题描述】：

我有一个包含 PapillomaViruses 序列（整个基因组、部分 CDS ......）的 fasta 文件，我正在使用 biopython 从这个文件中检索整个基因组（大约 7kb），所以这是我的代码：

rec_dict = SeqIO.index("hpv_id_name_all.fasta","fasta")

for k in rec_dict.keys():

    c=c+1

    if len(rec_dict[k].seq)>7000:

        handle=open(rec_dict[k].description+"_"+str(len(rec_dict[k].seq))+".fasta","w")

        handle.write(">"+rec_dict[k].description+"\n"+str(rec_dict[k].seq)+"\n")

        handle.close()

我正在使用字典来避免将所有内容加载到内存中。变量“c”用于知道在弹出此错误之前进行了多少次迭代：

Traceback (most recent call last):

File "<stdin>", line 4, in <module>

IOError: [Errno 2] No such file or directory: 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'

当我打印“c”的值时，我得到 9013，而文件包含 10447 个序列，这意味着 for 循环没有遍历所有序列（计数在“if”条件之前完成，所以 i计算所有迭代，而不仅仅是那些匹配条件的迭代）。我不明白 INPUT/OUTPUT 错误，它应该创建 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' 文件而不是验证它的存在，不是吗？

【问题讨论】：

这是另一个站点上问题的重复：biostars.org/p/167918
我在两个网站上都在问这个问题，因为我很着急，我需要一个快速的答案

标签： biopython fasta ioerror

【解决方案1】：

您尝试创建的文件 -- 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' -- 包含一个斜杠 ('/')，Python 将其解释为目录 'EU410347.1 |Human papillomavirus FA75' 后跟文件名 'KI88-03_7401.fasta'，所以 Python 报错目录不存在。

您可能希望将斜线替换为其他内容，例如

handle=open(rec_dict[k].description.replace('/', '_')+"_"+str(len(rec_dict[k].seq))+".fasta","w")

【讨论】：