【问题标题】:Perl - Output From Regular Expression Match Acts Very Strange, IndeedPerl - 正则表达式匹配的输出非常奇怪,确实
【发布时间】:2014-04-16 14:53:59
【问题描述】:

我正在使用 Perl 和正则表达式来解析(不良)格式的输入文本文件中的条目。我的代码将输入文件的内容存储到 $genes 中,并且我定义了一个带有捕获组的正则表达式,以将有趣的位存储在三个变量中:$number、$name 和 $sequence(参见下面的 Script.pl sn-p )。

这一切都很完美,直到我尝试打印出 $sequence 的值。我正在尝试在值周围添加引号,我的输出看起来像这样:

Number: '132'
Name: 'rps12 AmtrCp046'
'equence: 'ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA

Number: '134'
Name: 'psbA AmtrCp001'
'equence: 'ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA

注意序列中缺少的 S 已被单引号替换,并注意序列本身并没有像我预期的那样在其周围加上引号。我不明白为什么 $sequence 的 print 语句表现得如此奇怪。我怀疑我的正则表达式有问题,但我一点也不知道那可能是什么。任何帮助将不胜感激!

Script.pl sn-p

while ($genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+\s)/g) {
   # Get the value of the first capture group in the matched string (the first bit of stuff in parenthesis)
   # ([0-9+)
   $number = $1;

   # Get the value of the fourth capture group
   # ([A-Za-z0-9]*\s+[A-Za-z0-9]+)
   $name = $4;

   # Get the value of the fifth capture group
   # ([ACGT]+\s)
   $sequence = $5;

   print "Number: \." . $number . "\.\n";
   print "Name: \'" . $name . "\'\n";
   print "sequence: \'" . $sequence . "\'\n";
   print "\n";
}

输入文件sn-p

132 gnl|Ambtr|rps12 AmtrCp046 ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATAACCTGGAATCTCACAAA AATCTGAATTTTTAGAAATTGTTCATTCAATTAATTTCAAATAACATATTCGTGGAATACGATTCACTTT CAAGATGCCTTGATGGTGAAATGGTAGACACGCGAGACTCAAAATCTCGTGCTAAAGAGCGTGGAGGTTC GAGTCCTCTTCAAGGCATTGAGAATGCTCATTGAATGAGCAATTCAATAACAGAAACAGATCTCGGATCT AATCGATATTGGCAAGTTTCATACGAAGTATTCCGGCGATCCCCACGATCCGAGTCCGAGCTGTTGTTTG ATTTAGTTATTCAGTTAACCA

>134          gnl|Ambtr|psbA AmtrCp001
ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA
TTGATGGGATCCGTGAACCTGTTTCTGGTTCTCTACTTTATGGAAACAATATTCTTTCTGGTGCCATTAT
TCCAACCTCTGCAGCTATAGGTTTGCATTTTTACCCAATATGGGAAGCGGCATCCGTTGATGAATGGTTA
TACAATGGTGGTCCTTATGAGTTAATTGTCCTACACTTCTTACTTAGTGTAGCTTGTTACATGGGTCGTG
AGTGGGAACTTAGTTTCCGTCTGGGTATGCGCCCTTGGATTGCTGTTGCATATTCAGCTCCTGTTGCAGC
TGCTACTGCTGTTTTCTTGATCTACCCTATTGGTCAAGGAAGTTTCTCAGATGGTATGCCTCTAGGAATA
TCTGGTATTTTCAACTTGATGATTGTATTCCAGGCGGAGCACAACATCCTTATGCACCCATTTCACATGT
TAGGCGTAGCTGGTGTATTCGGCGGCTCCCTATTCAGTGCTATGCATGGTTCCTTGGTAACCTCTAGTTT
GATCAGGGAAACCACTGAAAATGAGTCTGCTAATGCAGGTTACAGATTCGGTCAAGAGGAAGAAACCTAT
AATATCGTAGCTGCTCATGGTTATTTTGGTCGATTGATCTTCCAATATGCTAGTTTCAACAATTCTCGTT
CCTTACATTTCTTCCTAGCTGCTTGGCCCGTAGTAGGTATTTGGTTCACTGCTTTGGGTATTAGCACTAT
GGCTTTCAACCTAAATGGTTTCAATTTCAACCAATCCGTAGTTGACAGTCAAGGTCGTGTCATCAACACT
TGGGCTGATATAATCAACCGTGCTAACCTTGGTATGGAAGTTATGCATGAACGTAATGCTCACAATTTCC
CTCTAGACTTAGCTGCTGTTGAAGCTCCATCTACAAATGGATAA

【问题讨论】:

    标签: regex perl


    【解决方案1】:

    输入文件似乎使用 CR+LF 来结束行。您将它存储到 $sequence (因为 \s 在捕获括号内)。打印时,它将光标移动到行首,然后打印最后的引号,覆盖“Sequence”中的“S”。

    解决方案:不要捕获变量中的最后一个空格。

    $genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+)\s/g
    #                                                                                        ^^^  
    

    【讨论】:

    • 啊!谢谢!我认为这是我忽略的简单/愚蠢的事情。
    【解决方案2】:
      while ($genes =~ m/^.*?([0-9]+).*\|([\w ]+)(.+)$/simg) {
    
       # Get the value of the first capture group
       $number = $1;
    
       # Get the value of the second capture group
       $name = $2;
    
       # Get the value of the third capture group
       # ([ACGT]+\s)
       $sequence = $3;
    
       print "Number: \." . $number . "\.\n";
       print "Name: \'" . $name . "\'\n";
       print "sequence: \'" . $sequence . "\'\n";
       print "\n";
    }
    

    解释:

    Options: dot matches newline; case insensitive; ^ and $ match at line breaks
    
    Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
    Match any single character «.*?»
       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
    Match the regular expression below and capture its match into backreference number 1 «([0-9]+)»
       Match a single character in the range between “0” and “9” «[0-9]+»
          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    Match any single character «.*»
       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
    Match the character “|” literally «\|»
    Match the regular expression below and capture its match into backreference number 2 «([\w ]+)»
       Match a single character present in the list below «[\w ]+»
          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
          A word character (letters, digits, and underscores) «\w»
          The character “ ” « »
    Match the regular expression below and capture its match into backreference number 3 «(.+)»
       Match any single character «.+»
          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    Assert position at the end of a line (at the end of the string or before a line break character) «$»
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多