【问题标题】:why my sed script to split FASTA file is slow?为什么我分割 FASTA 文件的 sed 脚本很慢?
【发布时间】:2016-08-26 14:41:57
【问题描述】:

我有一个 600 Mb 的 FASTA 文件,其中包含来自 12 个物种的许多比对块,我想将它们拆分成更小的 FASTA 文件,每个文件包含一个块及其对应的比对

我有一个如下所示的 sed 脚本:

#!/bin/bash
echo
for i in {0..Nblocks}; do
  sed -n "/block_index=$i|/,/^$/p" genome12species.fasta > bloque$i.fasta
done

这适用于小规模,但对于 600Mb 的大文件,它需要很长时间,大约 2 天。我认为这与我正在运行的计算机无关。

有谁知道如何加快速度?

输入的 Fasta 文件如下所示:

dm3.chr3R(-):17092630-17092781|sequence_index=0|block_index=4|species=dm3|dm3_4_0 GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta droGri2.scaffold_15074(-):2610183-2610334|sequence_index=0|block_index=4|species=droGri2|droGri2_4_0 GGCGGAGATCAAGAATCGTGTTGGGCCGCCGTCGAGCGCCACCGATAACGCTAGCAAAGTGAAAATCGATCAGGGACGCCCAGTGGAAAACAATAGATCTGGTTGCTGCTAAATAA-CTCTGATTGTGAATCATTATTTTATTATACAATTa droMoj3.scaffold_6540(+):33866311-33866462|sequence_index=0|block_index=4|species=droMoj3|droMoj3_4_0 TGCCGAGATTAAGAATCGTGTCGGTCCGCCGTCCAGCGCAACCGACAATGCAAGCAAAGTGAAAATCGATCAGGGACGTCCAGTGGAGAACACCAGATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCATTATTTTATTatacaatta droVir3.scaffold_12822(+):1248119-1248270|sequence_index=0|block_index=4|species=droVir3|droVir3_4_0 GGCCGAGATTAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGATAATGCTAGCAAAGTGAAAATCGATCAGGGTCGTCCAGTGGAGAACACCAAATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta droWil1.scaffold_181130(-):16071336-16071488|sequence_index=0|block_index=4|species=droWil1|droWil1_4_0 GGCCGAGATTAAGAATCGTGTTGGGCCGCCGTCCAGCGCCACTGATAATGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAATACCAAATCCGGTTGCTGCTGAATAAACTCTGATTGTGAATCATTATTTTATTATACAATTA droPer1.super_19(-):1310088-1310239|sequence_index=0|block_index=4|species=droPer1|droPer1_4_0 GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAACCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta dp4.chr2(-):5593491-5593642|sequence_index=0|block_index=4|species=dp4|dp4_4_0 GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAGCCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta droAna3.scaffold_13340(-):3754154-3754305|sequence_index=0|block_index=4|species=droAna3|droAna3_4_0 GGCCGAGATCAAGAATCGCGTCGGGCCACCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAGATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattataaaatta droEre2.scaffold_4770(+):4567591-4567742|sequence_index=0|block_index=4|species=droEre2|droEre2_4_0 GGCCGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta droYak2.chr3R(-):5883047-5883198|sequence_index=0|block_index=4|species=droYak2|droYak2_4_0 GGCCGAGATCAAGAATCGCGTCGGGCCGCCATCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattattattattatacaatta droSec1.super_38(+):36432-36583|sequence_index=0|block_index=4|species=droSec1|droSec1_4_0 GGCGGAGATCAAGAATCGCGTCGGTCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta droSim1.chr3R(+):4366350-4366501|sequence_index=0|block_index=4|species=droSim1|droSim1_4_0 GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta

dm3.chr3R(-):17092781-17092867|sequence_index=0|block_index=5|species=dm3|dm3_5_0 GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC droSim1.chr3R(+):4366264-4366350|sequence_index=0|block_index=5|species=droSim1|droSim1_5_0 GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTTATGACGATGGC droSec1.super_38(+):36346-36432|sequence_index=0|block_index=5|species=droSec1|droSec1_5_0 GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC droYak2.chr3R(-):5883198-5883284|sequence_index=0|block_index=5|species=droYak2|droYak2_5_0 GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACATCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC droEre2.scaffold_4770(+):4567505-4567591|sequence_index=0|block_index=5|species=droEre2|droEre2_5_0 GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC droAna3.scaffold_13340(+):20375068-20375148|sequence_index=0|block_index=5|species=droAna3|droAna3_5_0 ------GCCGAAAACTTCGACATGCCCTTCTTCGAGGTCTCTTGCAAGTCAAACATCAATATTGAAGATGCGTTTCTTTCCCTGGC dp4.chr2(-):5593642-5593728|sequence_index=0|block_index=5|species=dp4|dp4_5_0 GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC droPer1.super_19(-):1310239-1310325|sequence_index=0|block_index=5|species=droPer1|droPer1_5_0 GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC droWil1.scaffold_181130(-):16071488-16071574|sequence_index=0|block_index=5|species=droWil1|droWil1_5_0 GAATATGCGGCTCAGTTAGGCATTCCATTCCTTGAAACTTCGGCAAAGAGTGCCACCAATGTGGAGCAGGCCTTTATGACGATGGC droVir3.scaffold_12822(+):1248033-1248119|sequence_index=0|block_index=5|species=droVir3|droVir3_5_0 GAGTACGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCATTTATGACGATGGC droMoj3.scaffold_6540(+):33866225-33866311|sequence_index=0|block_index=5|species=droMoj3|droMoj3_5_0 GAGTATGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAATGTAGAGCAGGCATTCATGACGATGGC droGri2.scaffold_15074(-):2610334-2610420|sequence_index=0|block_index=5|species=droGri2|droGri2_5_0 GAGTACGCAAATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCGAAGAGTGCCACCAATGTGGAACAGGCATTCATGACGATGGC

【问题讨论】:

  • 肯定 awk 将能够为您解决此问题,只需通过文件 1 次,并且只需几分钟。如果您与 sed “结婚”,那么更改为 .../^$/{p,q}' 可能有助于加快速度,但仍远不及 awk。添加一小组假/样本数据来说明您的文件格式(最多 5-10 行,最多 30-60 个字符宽),人们会对此进行深入研究。 (我要出去了 ;-/ ) 。祝你好运。
  • {0..Nblocks} 真的有效吗? (我认为不会。)
  • @shellter。感谢您的洞察力。我没有嫁给 sed,我只是一个信息学的菜鸟,sed 是我唯一的想法。直到今天我才知道 awk 存在。
  • @BenjaminW。它没有,我必须手动输入块数。

标签: sed fasta sequence-alignment


【解决方案1】:

这里有一个 awk oneliner 可以帮助您入门 - 它使用与 sed 相同的正则表达式范围 - 匹配的 block_index 以 m[1] 为单位 - 600MB 只需几分钟

awk 'match($0, /block_index=([0-9]+)\|/, m),/^$/ {print >"bloque"m[1]".fasta"}'

【讨论】:

  • 您介意发布此解决方案的时间吗?谢谢
  • 当然!我没有精确计时,但它在大约 20 分钟内运行了一个 650Mb 和 215000 个块的 12 个物种比对的文件。
  • 这很合理 - 我是否正确理解 215,000 个块意味着创建了 215,000 个文件?
猜你喜欢
  • 1970-01-01
  • 2013-06-13
  • 1970-01-01
  • 1970-01-01
  • 2012-04-07
  • 1970-01-01
  • 2014-04-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多