【问题标题】:Reading a Fasta file from Url address从 Url 地址读取 Fasta 文件
【发布时间】:2015-04-17 04:24:12
【问题描述】:

我使用的是 Python 3.4。
我写了一些代码来从互联网站点读取 Fasta 文件,但是没有用。 http://www.uniprot.org/uniprot/B5ZC00.fasta
(我可以将其作为文本文件下载并阅读,但我打算从给定站点读取多个 Fasta 文件。)

(1)第一次尝试

# read FASTA file

def read_fasta(filename_as_string):
    """
    open text file with FASTA format
    read it and convert it into string list
    convert the list to dictionary
    >>> read_fasta('sample.txt')
        {'Rosalind_0000':'GTAT....ATGA', ... }
    """
    f = open(filename_as_string,'r')
    content = [line.strip() for line in f]
    f.close()

    new_content = []
    for line in content:
        if '>Rosalind' in line:
            new_content.append(line.strip('>'))
            new_content.append('')
        else:
            new_content[-1] += line

    dict = {}
    for i in range(len(new_content)-1):
        if i % 2 == 0:
            dict[new_content[i]] = new_content[i+1]

    return dict

此代码可以读取我台式计算机中的任何 Fasta 文件,但无法从 Internet 站点读取相同的 Fasta 文件。

>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> print (read_fasta(html))
TypeError: invalid file: <http.client.HTTPResponse object at 0x02A62EF0>

(2)第二次尝试

>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> lines = [x.strip() for x in html.readlines()]
>>> print (lines)
[b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1', b'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ', b'KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS', b'NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN', b'FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY', b'LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD', b'LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM', b'DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY', b'CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK']

我以为我可以修改我的代码以将在线 Fasta 文件作为字符串列表读取,但很快我意识到这并不容易。

>>> print (type(lines[0]))
<class 'bytes'>

我无法删除列表中每个元素头部的脏“b”字符。

>>> print (lines[0])
b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
>>> print (lines[0][1:])
b'sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...

(3) 问题

如何删除脏的“b”字符?
有没有更好的方法从给定的 Url 读取 Fasta 文件?

在一些帮助下,我想我可以修改和完善我的代码。 谢谢。

【问题讨论】:

    标签: url fasta


    【解决方案1】:

    我来晚了,但如果有用的话我会回答

    在python 2中

    import urllib2
    
    url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
    response = urllib2.urlopen(url)
    fasta = response.read()
    
    print fasta
    

    在python 3中

    from urllib.request import urlopen
    
    url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
    response = urlopen(url)
    fasta = response.read().decode("utf-8", "ignore")
    
    print(fasta)
    

    你得到:

    >sp|B5ZC00|SYG_UREU1 甘氨酸--tRNA 连接酶 OS=解脲脲原体血清型 10(菌株 ATCC 33699 / Western)GN=glyQS PE=3 SV=1 MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK

    奖金

    最好使用 biopython(python 2 的示例)

    from Bio import SeqIO
    import urllib2
    
    url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
    response = urllib2.urlopen(url)
    fasta_iterator = SeqIO.parse(response, "fasta")
    
    for seq in fasta_iterator:
      print seq.format("fasta")
    

    【讨论】:

      【解决方案2】:

      如果您只对一级氨基酸序列感兴趣(想忽略标题),请尝试以下操作:

      link = str(sys.argv[1]) #fasta file URL provided as command line argument
      FASTA = urllib.urlopen(link).readlines()[1:] # as list without header (">...")
      FASTA = "".join(FASTA).replace("\n","") # as a string free of new line markers
      print FASTA
      

      【讨论】:

        【解决方案3】:

        聚会有点晚了,但尝试 Jose 的 Biopython 答案在 Python 3 中不再适用。这里有一个替代方案:

        from Bio import SeqIO
        import requests
        from io import StringIO
        
        link = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
        data = requests.get(link).text
        
        fasta_iterator = SeqIO.parse(StringIO(data), "fasta")
        
        # Pretty print the fasta info
        for seq in fasta_iterator:
          print(seq.format("fasta"))
        

        【讨论】:

          猜你喜欢
          • 2020-06-18
          • 2015-01-08
          • 1970-01-01
          • 1970-01-01
          • 2011-07-29
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多