根据 csv 文件中的切片索引连接切片字符串答案

【问题标题】：Concatenating sliced strings based on slice indices in csv file根据 csv 文件中的切片索引连接切片字符串
【发布时间】：2014-02-16 15:46:04
【问题描述】：

好吧，我的挑战似乎很容易，但我已经没有选择了。因此，我们将不胜感激。

我有许多 fasta 格式的 DNA 序列，它们需要在特定位置切片，然后连接生成的部分。所以如果我的序列文件是这样的：

~$ cat seq_file
>Sequence1
This is now a sequence that must require a bit of slicing and concatenation to be useful
>Sequence2
I have many more uncleaned strings like this in the form of sequences

我希望输出是这样的：

>Sequence1
This is useful
>Sequence2
I have cleaned sequences

现在切片部分由单独 csv 文件中的切片索引确定。在这种情况下，切片位置是这样组织的：

~$ cat test.csv
Sequence1,0,9,66,74,,
Sequence2,0,5,15,22,48,57

我的代码：

from Bio import SeqIO
import csv

seq_dict = {}
for seq_record in SeqIO.parse('seq_file', 'fasta'):
    descr = seq_record.description
    seq_dict[descr] = seq_record.seq

with open('test.csv', 'rb') as file:
    reader = csv.reader(file)
    for row in reader:
        seq_id = row[0] 
        for n in range(1,7): 
            if n % 2 != 0:
                start = row[n] # all start positions for the slice occupy non-even rows
            else:
                end = row[n] 

                for key, value in sorted(seq_dict.iteritems()):
                    #print key, value
                    if key == string_id: # cross check matching sequence identities
                        try:
                            slice_seq = value[int(start):int(end)]
                            print key
                            print slice_seq
                        except ValueError:
                            print 'Ignore empty slice indices.. '

现在会打印出来：

Sequence1
Thisisnow
Sequence1
useful
Ignore empty slice indices.. 
Sequence2
Ihave
Sequence2
cleaned
Sequence2
sequences

到目前为止一切顺利，这是我所期望的。但是如何通过连接或连接或通过 python 中可能的任何操作将切片部分组合在一起达到我想要的目的？谢谢。

【问题讨论】：

标签： python csv

【解决方案1】：

类似这样的：

import csv
from string import whitespace
with open('seq_file') as f1, open('test.csv')  as f2:
    for row in csv.reader(f2):
        it = iter(map(int, filter(None, row[1:])))
        slices = [slice(*(x,next(it))) for x in it]
        seq = next(f1)
        line = next(f1).translate(None, whitespace)
        print seq,
        print ' '.join(line[s] for s in slices)

输出：

>Sequence1
Thisisnow useful
>Sequence2
Ihave cleaned sequences

【讨论】：

太棒了。这个地方太棒了。

【解决方案2】：

您可以通过一些修改来实现：

with open('test.csv', 'rb') as file:
    reader = csv.reader(file)
    for row in reader:
        seq_id = row[0]
        seqs = []
        for n in range(1,7):
            if n % 2 != 0:
                start = row[n] # all start positions for the slice occupy non-even rows
            else:
                end = row[n]

                for key, value in sorted(seq_dict.iteritems()):
                    #print key, value
                    if key == seq_id: # cross check matching sequence identities
                        try:
                            seqs.append(value[int(start):int(end)])
                        except ValueError:
                            print 'Ignore empty slice indices.. '
        print ' '.join(str(x) for x in seqs)

【讨论】：

我使用str(x) 将Seq 对象制作成字符串，因为join 无法使用它们。
非常感谢。非常喜欢！！