【问题标题】:How to filter out sequences based on a given data using Python?如何使用 Python 根据给定数据过滤掉序列?
【发布时间】:2015-09-26 00:47:19
【问题描述】:

我会根据给定的文件 A.fasta 过滤掉我不想要的序列。原始文件包含所有序列,fasta 文件实际上是一个以序列 ID 开头的文件,后跟由 A、T、C、G 表示的核苷酸。有人能帮帮我吗?

A.fasta

>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA

Original.fasta

>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA

C.fasta 的预期输出

>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA

代码

import sys
import warnings
from Bio import SeqIO
from Bio import BiopythonDeprecationWarning
warnings.simplefilter('ignore',BiopythonDeprecationWarning)

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        nuc = seq.seq.tostring()
        if nuc not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

上面的代码将过滤掉重复的序列,但如果它确实出现在输出中,我想保留重复的序列

【问题讨论】:

  • 不要过滤掉DeprecationWarning!他们在那里是有原因的:他们在那里告诉你方法tostring() 不再使用并且将在BioPython 的未来版本中被删除。相反,使用更现代的方式来获取Seqobject 的字符串表示:而不是nuc = seq.seq.tostring(),而是写nuc = str(seq.seq)

标签: python filter filtering bioinformatics fasta


【解决方案1】:

查看BioPython。这是一个使用它的解决方案:

from Bio import SeqIO

input_file = 'a.fasta'
merge_file = 'original.fasta'
output_file = 'results.fasta'
exclude = set()
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    exclude.add(fasta.id)

fasta_sequences = SeqIO.parse(open(merge_file),'fasta')
with open(output_file, 'w') as output_handle:
   for fasta in fasta_sequences:
        if fasta.id not in exclude:
            SeqIO.write([fasta], output_handle, "fasta")

【讨论】:

  • 谢谢 :) 非常感谢
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2019-09-20
  • 2020-10-01
  • 1970-01-01
  • 2018-01-16
  • 2021-06-20
  • 2021-09-04
  • 2021-07-05
相关资源
最近更新 更多