【发布时间】:2021-02-22 20:40:25
【问题描述】:
大家好,我有一个文件,例如 ;
ORFs.fa
>scaffold_11404_1 [179 - 301]
MLLLKKAQCLTREE
>scaffold_11404_38 [5350 - 3194] (REVERSE SENSE)
MADQKNLQMSRDLALCARHGIPSLFAFLGDIVSTGISQYAISKLMVANLDLSNVDTKLNA
WQTEGGKYYAAEALIRKLDAIDRQMTEPARIACKYGLLVDLRHTLDFATDNMVANARAEV
MLDMRSYHPSNAMLQNNLTRIMVLVKNTPPQSVVSGKQAMRYIPGWQEDLECPMQKYVFF
>scaffold_11404_45 [2557 - 2450] (REVERSE SENSE)
MCKQGICRHTRHLSHIMFKLWNNFKYQNIKETRISD
>scaffold_11404_46 [2311 - 2436]
MIFIELKYSSSLKNYNSSKFNIKNLTKLKHQFYLFFYTFFNT
我需要将其更改为具有 5 列的数据框,例如:
ORF_df
Segments start2 end2 sens sequence
scaffold_11404_1 179 301 normal MLLLKKAQCLTREE
scaffold_11404_38 5350 3194 reverse MADQKNLQMSRDLALCARHGIPSLFAFLGDIVSTGISQYAISKLMVANLDLSNVDTKLNA
WQTEGGKYYAAEALIRKLDAIDRQMTEPARIACKYGLLVDLRHTLDFATDNMVANARAEV
MLDMRSYHPSNAMLQNNLTRIMVLVKNTPPQSVVSGKQAMRYIPGWQEDLECPMQKYVFF
scaffold_11404_45 2557 2450 reverse MCKQGICRHTRHLSHIMFKLWNNFKYQNIKETRISD
scaffold_11404_46 2311 2436 normal MIFIELKYSSSLKNYNSSKFNIKNLTKLKHQFYLFFYTFFNT
有人有想法吗?
到目前为止,我尝试了这段代码,它可以工作,但速度很慢......
ORF_df=pd.DataFrame(columns=("Segments","start2","end2","sens","sequence"))
with open("ORFs.fa") as fasta_file: # Will close handle cleanly
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
full_name=seq_record.description
sens=re.sub(".*\(","",full_name)
if sens == 'REVERSE SENSE)':
sens="reverse"
else:
sens="normal"
start_end=re.sub(".*\[","",full_name)
start_end=re.sub("\].*","",start_end)
start_end=start_end.split("-")
start=start_end[0]
end=start_end[1]
sequence=seq_record.seq
Segments=seq_record.id
ORF_df=ORF_df.append({"Segments":re.sub("_[^_]*$","",Segments), "sequence":str(sequence), "start2":start,"end2":end, "sens":sens},ignore_index=True)
print(ORF_df)
【问题讨论】: