【发布时间】:2018-04-01 10:51:51
【问题描述】:
我正在尝试编写一个脚本,当提供两个字符串时,它将执行两个功能:
1。查找从pos[0] 开始在两个字符串中相同的最长字符序列
Seq1 = 'ATCCTTAGC'
Seq2 = 'ATCCAGCAATTC'
^^^^ Match from pos[0] to pos[3]
Pos: 0:3
Length: 4
Seq: ATCC
2。查找两个字符串中存在的最长的连续字符
Seq1 = 'TAGCTCCTTAGC' # Contains 'TCCTT'
Seq2 = 'GCAGCCATCCTTA' # Contains 'TCCTT'
^ No match at pos[0]
Pos1: 4:8
Pos2 7:11
Length: 5
Seq: TCCTT
要完成问题 1,我有以下几点:
#!/usr/bin/python
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
print("Upstream: %s\nDownstream: %s\n") % (upstream_seq, downstream_seq)
mh = 0
pos_count = 0
seq = ""
position =""
longest_hom=""
for i in range(len(upstream_seq)):
pos_count += 1
if upstream_seq[i] == downstream_seq[i]:
mh += 1
seq += upstream_seq[i]
position = pos_count
longest_hom = mh
else:
mh = 0
break
print("Pos: 0:%s\nLength: %s\nSeq: %s\n") % (position , longest_hom, seq)
Upstream: ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC
Downstream: ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG
Pos: 0:5
Length: 5
Seq: ATACA
我遇到了问题 2。到目前为止,我已经使用BioPython's pairwise2 考虑了两个序列之间的比对。但是,在这种情况下,我只想要完美匹配(没有间隙,没有扩展),我只想看到最长的序列,而不是我似乎得到的共识:
from Bio import pairwise2 as pw2
global_align = pw2.align.globalms(upstream_seq, downstream_seq, 3, -1, -.5, -.5)
print(global_align[0])
('ATACATT-G----GCC-TTGGCTTA-----G--ACTTAGATCTAG-----ACCTGAA----AATAACCTGCCGAAAA-GACC-CGCCCGACTGTTAATACTT-TACGCG-AG-GCT-CAC-C-T-TT--TTGT-TG----T---GCTCC--C-', 'ATACA--CGAAAAG-CGTT--CTT-TTTTTGCCACTT---T-T--TTTTTA--TG--TTTCAA-AA-C-G--GAAAATG---TCG--C--C-G----T-C--GT-CG-GGAGAG-TGC-CTCCTCTTAGTT-TAT-CAAATAAAGCT--TTCG', 151.0, 0, 153)
问题:如何找到两个字符串中存在的最长的连续字符?
【问题讨论】:
-
你的问题是什么?
-
这里有什么帮助吗? stackoverflow.com/questions/18715688/…
标签: python bioinformatics