【问题标题】:Concatenating sliced strings based on slice indices in csv file根据 csv 文件中的切片索引连接切片字符串
【发布时间】:2014-02-16 15:46:04
【问题描述】:

好吧,我的挑战似乎很容易,但我已经没有选择了。因此,我们将不胜感激。

我有许多 fasta 格式的 DNA 序列,它们需要在特定位置切片,然后连接生成的部分。所以如果我的序列文件是这样的:

~$ cat seq_file
>Sequence1
This is now a sequence that must require a bit of slicing and concatenation to be useful
>Sequence2
I have many more uncleaned strings like this in the form of sequences

我希望输出是这样的:

>Sequence1
This is useful
>Sequence2
I have cleaned sequences

现在切片部分由单独 csv 文件中的切片索引确定。在这种情况下,切片位置是这样组织的:

~$ cat test.csv
Sequence1,0,9,66,74,,
Sequence2,0,5,15,22,48,57

我的代码:

from Bio import SeqIO
import csv

seq_dict = {}
for seq_record in SeqIO.parse('seq_file', 'fasta'):
    descr = seq_record.description
    seq_dict[descr] = seq_record.seq

with open('test.csv', 'rb') as file:
    reader = csv.reader(file)
    for row in reader:
        seq_id = row[0] 
        for n in range(1,7): 
            if n % 2 != 0:
                start = row[n] # all start positions for the slice occupy non-even rows
            else:
                end = row[n] 

                for key, value in sorted(seq_dict.iteritems()):
                    #print key, value
                    if key == string_id: # cross check matching sequence identities
                        try:
                            slice_seq = value[int(start):int(end)]
                            print key
                            print slice_seq
                        except ValueError:
                            print 'Ignore empty slice indices.. '

现在会打印出来:

Sequence1
Thisisnow
Sequence1
useful
Ignore empty slice indices.. 
Sequence2
Ihave
Sequence2
cleaned
Sequence2
sequences

到目前为止一切顺利,这是我所期望的。但是如何通过连接或连接或通过 python 中可能的任何操作将切片部分组合在一起达到我想要的目的?谢谢。

【问题讨论】:

    标签: python csv


    【解决方案1】:

    类似这样的:

    import csv
    from string import whitespace
    with open('seq_file') as f1, open('test.csv')  as f2:
        for row in csv.reader(f2):
            it = iter(map(int, filter(None, row[1:])))
            slices = [slice(*(x,next(it))) for x in it]
            seq = next(f1)
            line = next(f1).translate(None, whitespace)
            print seq,
            print ' '.join(line[s] for s in slices)
    

    输出:

    >Sequence1
    Thisisnow useful
    >Sequence2
    Ihave cleaned sequences
    

    【讨论】:

    • 太棒了。这个地方太棒了。
    【解决方案2】:

    您可以通过一些修改来实现:

    with open('test.csv', 'rb') as file:
        reader = csv.reader(file)
        for row in reader:
            seq_id = row[0]
            seqs = []
            for n in range(1,7):
                if n % 2 != 0:
                    start = row[n] # all start positions for the slice occupy non-even rows
                else:
                    end = row[n]
    
                    for key, value in sorted(seq_dict.iteritems()):
                        #print key, value
                        if key == seq_id: # cross check matching sequence identities
                            try:
                                seqs.append(value[int(start):int(end)])
                            except ValueError:
                                print 'Ignore empty slice indices.. '
            print ' '.join(str(x) for x in seqs)
    

    【讨论】:

    • 我使用str(x)Seq 对象制作成字符串,因为join 无法使用它们。
    • 非常感谢。非常喜欢!!
    猜你喜欢
    • 1970-01-01
    • 2020-11-27
    • 2022-01-16
    • 2017-10-16
    • 2016-08-29
    • 2021-02-13
    • 1970-01-01
    • 2018-06-11
    • 1970-01-01
    相关资源
    最近更新 更多