【问题标题】:How do you retrieve texts from successive lines and make tab delimited columns in python?你如何从连续的行中检索文本并在 python 中制作制表符分隔的列?
【发布时间】:2018-10-26 06:34:22
【问题描述】:

我完全是 Python 新手,想看看它在 Python 中是如何工作的。

我在下面有这个数据,称为data.txt,我想从这个数据中检索四列。首先,我要检索 degradome 类别,然后是 p 值,然后是 Query: 之前和之后的文本。所以结果应该是这样的:

Degardome Category: 4    Degradome p-value: 0.00120246641531374  3' AUUAAUAACCGGCCUGUUUGC 5'   Seq_1950_218
Degardome Category: 4    Degradome p-value: 0.00360306320817827  3' ACUUUCUUUUCUUAA--UCUUUC 5'  Seq_2171_593 

数据.txt:

Degardome Category: 4
Degradome p-value: 0.00120246641531374
T-Plot file: T-plots-IGR/Seq_5744_249_Supercontig_2.10_1257006_264_TPlot.pdf

Position    Reads   Category
264 1   4   <<<<<<<<<<
914 1   4
987 4   0
---------------------------------------------------
---------------------------------------------------

5' UUGGAGGUGGCUGGACGGAUG 3' Transcript: Supercontig_2.10_1395094:908-928 Slice Site:919
          ||||o||||oo|o|
3' AUUAAUAACCGGCCUGUUUGC 5' Query: Seq_1950_218
HV2.fasta_dd.txt
Degardome Category: 4
Degradome p-value: 0.00360306320817827
T-Plot file: T-plots-IGR/Seq_1950_218_Supercontig_2.10_1395094_919_TPlot.pdf

Position    Reads   Category
919 1   4   <<<<<<<<<<
---------------------------------------------------
---------------------------------------------------

5' AGAAGGGGAAGAGUGGAGGAGAG 3' Transcript: Supercontig_2.10_1543625:626-648 Slice Site:637
    |||o|oo||||o|   o||o||
3' ACUUUCUUUUCUUAA--UCUUUC 5' Query: Seq_2171_593

【问题讨论】:

  • 到目前为止你尝试了什么?

标签: python string text


【解决方案1】:

使用模块re的解决方案:

pattern1 = re.compile(r'Degardome Category')
pattern2 = re.compile(r'Degradome p-value')
pattern3 = re.compile(r'Query')

l1 = []
l2 = []
l3 = []

with open('/home/mayankp/data.txt') as f:
    for i in f:
        if pattern1.search(i):
            a = re.sub('\n','',i)
            l1.append(a)
        elif pattern2.search(i):
            a = re.sub('\n','',i)
            l2.append(a)
        elif pattern3.search(i):
            a = re.sub('Query:','',i)
            b = re.sub('\n','',a)
            l3.append(b)

In [1244]: output = zip(l1,l2,l3)

In [1245]: output
Out[1245]: 
[('Degardome Category: 4',
  'Degradome p-value: 0.00120246641531374',
  "3' AUUAAUAACCGGCCUGUUUGC 5'  Seq_1950_218"),
 ('Degardome Category: 4',
  'Degradome p-value: 0.00360306320817827',
  "3' ACUUUCUUUUCUUAA--UCUUUC 5'  Seq_2171_593")]

现在,您可以将这个output 写入文件。

【讨论】:

    【解决方案2】:

    如果您使用

    读取整个文件
     with open('file.txt', 'r') as f:
         a = f.read()
     a = a.split('\n')
    

    将给出以下输出:

    ['Degardome Category: 4',
     'Degradome p-value: 0.00120246641531374',
     'T-Plot file: T-plots IGR/Seq_5744_249_Supercontig_2.10_1257006_264_TPlot.pdf',
     '',
     'Position    Reads   Category',
     '264 1   4   <<<<<<<<<<',
     '914 1   4',
     '987 4   0',
     '---------------------------------------------------',
     '---------------------------------------------------',
     '',
     "5' UUGGAGGUGGCUGGACGGAUG 3' Transcript: Supercontig_2.10_1395094:908-928 Slice Site:919",
     '          ||||o||||oo|o|',
     "3' AUUAAUAACCGGCCUGUUUGC 5' Query: Seq_1950_218",
     'HV2.fasta_dd.txt',
     'Degardome Category: 4',
     'Degradome p-value: 0.00360306320817827',
     'T-Plot file: T-plots-IGR/Seq_1950_218_Supercontig_2.10_1395094_919_TPlot.pdf',
     '',
     'Position    Reads   Category',
     '919 1   4   <<<<<<<<<<',
     '---------------------------------------------------',
     '---------------------------------------------------',
     '',
     "5' AGAAGGGGAAGAGUGGAGGAGAG 3' Transcript: Supercontig_2.10_1543625:626-648 Slice Site:637",
     '    |||o|oo||||o|   o||o||',
     "3' ACUUUCUUUUCUUAA--UCUUUC 5' Query: Seq_2171_593"]
    

    现在初始化一个空字符串并连接所有相关部分:

    In [4]: t = ''
    In [5]: for line in a:
    ...:     if 'Degardome Category:' in line:
    ...:         t += line + ' '
    ...:     if 'Degradome p-value:' in line:
    ...:         t += line + ' '
    ...:     if 'Query' in line:
    ...:         t += line.replace('Query:', '') + '\n'
    

    最后,根据换行分割字符串:

    In [6]: out = [i for i in t.split('\n') if i]
    
    In [7]: out
    Out[7]:
    ["Degardome Category: 4 Degradome p-value: 0.00120246641531374 3' 
     AUUAAUAACCGGCCUGUUUGC 5'  Seq_1950_218",
     "Degardome Category: 4 Degradome p-value: 0.00360306320817827 3' 
     ACUUUCUUUUCUUAA--UCUUUC 5'  Seq_2171_593"]
    

    【讨论】:

      猜你喜欢
      • 2012-10-10
      • 2012-04-26
      • 1970-01-01
      • 2016-01-05
      • 1970-01-01
      • 2010-10-10
      • 2017-08-04
      • 2015-07-11
      • 2017-05-09
      相关资源
      最近更新 更多