【问题标题】:Searching and marking paired patterns on a line在一条线上搜索和标记配对模式
【发布时间】:2012-03-29 00:18:31
【问题描述】:

我需要搜索并标记在一条线上某处分割的模式。这是一个简短的示例模式列表,它们被放置在一个单独的文件中,例如:

CAT,TREE
LION,FOREST
OWL,WATERFALL

如果第 2 列中的项目与第 1 列中的项目出现在同一行之后,则会出现匹配项。例如:

THEREISACATINTHETREE. (matches)

如果第 2 列中的项目首先出现在该行中,则不会出现匹配项,例如:

THETREEHASACAT. (does not match)

此外,如果第 1 列和第 2 列中的项目接触,则不会出现匹配项,例如:

THECATTREEHASMANYBIRDS. (does not match)

一旦找到任何匹配项,我需要用\start{n}(出现在第 1 列项目之后)和\end{n}(出现在第 2 列项目之前)标记它,其中n 是一个简单的计数器,可以随时增加找到任何匹配项。例如:

THEREISACAT\start{1}INTHE\end{1}TREE.

这是一个更复杂的例子:

THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.

这就变成了:

THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.

有时在同一个地方有多个匹配项:

 THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.

这就变成了:

 THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
  • 文件中没有空格。
  • 文件中出现许多非拉丁字符。
  • 模式匹配只需要在同一行上找到(例如,第 1 行上的“CAT”永远不会与第 2 行上找到的“TREE”匹配,因为它们位于不同的行上)。

我怎样才能找到这些匹配项并以这种方式标记它们?

【问题讨论】:

  • bash 将是执行此任务的一个糟糕选择,它可以完成,但复杂性会很高。 Perl 非常适合这项工作,因为它是为这样的任务而创建的。
  • 要求没有明确规定。 CAT...TREE...CAT...TREE 会发生什么。第一个CAT 是否匹配两个TREE-s?还是CAT 的第二次出现干预?两个CAT-s 可以共享同一个终止TREE 吗?结果应该是CAT\start{1}\start{2}...\end{1}TREE...CAT\start{3}...\end{2}\end{3}TREE
  • 在 Perl 中进行完整的、自动的 UTF-8 处理真的很容易,Perl 生活和呼吸正则表达式。我会尝试一下,虽然我不知道@Kaz 提出的问题的答案。还有一些关于如何处理组合字符的字形的问题,因为您可能会遇到一些奇怪的情况,我认为您不想匹配部分字素。
  • 如果要求更具体,那将是一个非常简洁(且具有挑战性!)的高尔夫问题。
  • 我在寻找解决方案时玩得很开心,但我遇到了一个有趣的问题:你需要这样的东西有什么用? :)

标签: ruby perl bash python-2.7


【解决方案1】:

这是一种 Perl 方法:

#!/usr/bin/perl
use strict;
use warnings;
use 5.010;

# couples of patterns to search for
my @patterns = (
    ['CAT', 'TREE'],
    ['LION', 'FOREST'],
    ['OWL', 'WATERFALL'],
);

# loop over all sentences
while (my $line = <DATA>) {
    chomp $line;    #remove linefeed
    my $count = 1;  #counter of start/end
    foreach my $pats (@patterns) {
        #$p1=first pattern, $p2=second
        my ($p1, $p2) = @$pats;

        #split on patterns, keep them, remove empty
        my @s = grep {$_} split /($p1|$p2)/, $line;

        #$start=position where to put the \start
        #$end=position where to pt the \end
        my ($start, $end) = (undef, undef);

        #loop on all elements given by split
        for my $i (0 .. $#s) {
            # current element
            my $cur = $s[$i];

            #if = first pattern, keep its position in the array
            if ($cur eq $p1) {
                $start = $i;
            }

            #if = second pattern, keep its position in the array
            if ($cur eq $p2) {
                $end = $i;
            }

            #if both are defined and second pattern after first pattern
            # insert \start and \end
            if (defined($start) && defined($end) && $end > $start + 1) {
                $s[$start] .= "\\start{$count}";
                $s[$end] = "\\end{$count}" . $s[$end];
                undef $end;
                $count++;
            }
        }
        # recompose the line
        $line = join '', @s;
    }
    say $line;
}

__DATA__
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACATINTHETREE.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
CAT...TREE...CAT...TREE

输出:

THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACAT\start{1}INTHE\end{1}TREE.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT\start{1}...\end{1}TREE...CAT\start{2}...\end{2}TREE

【讨论】:

  • 不错。绝对比我的版本干净。
  • 但恐怕这个算法并不能正确覆盖CAT...TREE...CAT...TREE
  • 还有(虽然描述中没有提到),如果有pattern元素重叠,后面应用的pattern就不会匹配到了。
  • @AleksanderPohl:不确定CAT...TREE...CAT...TREE 的输出必须是什么。编辑了我的答案,结果与您的不同。谁是正确的?我不知道。
  • 检查问题的第二(和第三)评论 - 我想有正确的答案。
【解决方案2】:

看看这个(Ruby):

#!/usr/bin/env ruby
patterns = [
  ['CAT', 'TREE'],
  ['LION', 'FOREST'],
  ['OWL', 'WATERFALL']
]

lines = [
  'THEREISACATINTHETREE.',
  'THETREEHASACAT.',
  'THECATTREEHASMANYBIRDS.',
  'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
  'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
  'CAT...TREE...CAT...TREE'
]

lines.each do |line|
  puts line
  matches = Hash.new{|h,e| h[e] = [] }
  match_indices = []
  patterns.each do |first,second|
    offset = 0
    while new_offset = line.index(first,offset) do
      # map second element of the pattern to minimal position it might be matched
      matches[second] << new_offset + first.size + 1
      offset = new_offset + 1
    end
  end
  global_counter = 1
  matches.each do |second,offsets|
    offsets.each do |offset|
      second_offset = offset
      while new_offset = line.index(second,second_offset) do
        # register the end index of the first pattern and 
        # the start index of the second pattern with the global match count
        match_indices << [offset-1,new_offset,global_counter]
        second_offset = new_offset + 1
        global_counter += 1
      end
    end
  end
  indices = Hash.new{|h,e| h[e] = ""}
  match_indices.each do |first,second,global_counter|
    # build the insertion string for the string positions the 
    # start and end tags should be placed in
    indices[first] << "\\start{#{global_counter}}"
    indices[second] << "\\end{#{global_counter}}"
  end
  inserted_length = 0
  indices.sort_by{|k,v| k}.each do |position,insert|
    # insert the tags at their positions
    line.insert(position + inserted_length,insert)
    inserted_length += insert.size
  end
  puts line
end

结果

THEREISACATINTHETREE.
THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT...TREE...CAT...TREE
CAT\start{1}\start{2}...\end{1}TREE...CAT\start{3}...\end{2}\end{3}TREE

编辑

我插入了一些cmets并澄清了一些变量。

【讨论】:

    【解决方案3】:

    首先,您必须从模式中找到所有出现的开始和结束字符串。然后您需要找出哪些标签适合在一起(如果结束字符串位于起始字符串之前或位于相同位置并因此接触,则它们不适合)。然后你可以生成你的标签并插入到你的输出字符串中。请注意,您需要将插入的字符数添加到您的位置,因为插入标签时字符串的长度会发生变化。此外,您必须在插入标签之前按位置对标签进行排序,否则计算起来会变得非常复杂,您必须将位置移动多远。这是 Ruby 中的一个简短示例:

    patterns = [['CAT','TREE'], ['LION','FOREST'], ['OWL','WATERFALL']]
    strings = ['THEREISACATINTHETREE.', 'THETREEHASACAT.', 'THECATTREEHASMANYBIRDS.', 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.', 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.', 'ACATONATREEANDANOTHERCATONANOTHERTREE.', 'ACATONATREEBUTNOCATTREE.']
    
    strings.each do |string|
      matches = {}; tags = []
      counter = shift = 0
      output = string.dup
    
      patterns.each do |sstr,estr|                # loop through all patterns
        posa = []; posb = [];                     #
        string.scan(sstr){posa << $~.end(0)}      # remember found positions and
        string.scan(estr){posb << $~.begin(0)}    # find all valid combinations (next line)
        matches[[sstr,estr]] = posa.product(posb).reject{|s,e|s>=e}
      end
    
      matches.each do |pat,pos|                   # loop through all matches
        pos.each do |s,e|                         # 
          tags << [s,"\\start{#{counter += 1}}"]  # generate and remember \start{}
          tags << [e,"\\end{#{counter}}"]         # and \end{} tags
        end
      end
    
      tags.sort.each do |pos,tag|                 # sort and loop through tags
        output.insert(pos+shift,tag)              # insert tag and increment
        shift += tag.chars.count                  # shift by num. of inserted chars
      end
    
      puts string, output                         # print result
    end
    

    它并不漂亮,但它满足您的所有要求。我认为下一个示例更具可读性和可重用性,它被实现为具有相应单元测试的 Ruby 类,以确保其正常工作:

    class PatternMarker
      require 'english'
    
      attr_reader :input, :output, :matches
    
      def initialize patterns
        @patterns = patterns
        raise ArgumentError, 'no patterns given' unless @patterns.any?
        @patterns.each do |p|
          raise ArgumentError, 'every pattern must have exactly two strings' unless p.count == 2
        end
      end
    
      def parse input
        @input = input.dup
        match_patterns
        generate_output
        self
      end
    
      def match?
        @matches.any?
      end
    
    private
    
      def match_patterns
        @matches = {}
        @patterns.each do |start_str,end_str|
          pos = { :start => [], :end => [] }
          @input.scan(start_str){ pos[:start] << $LAST_MATCH_INFO.end(0)   }
          @input.scan(end_str  ){ pos[:end]   << $LAST_MATCH_INFO.begin(0) }
          @matches[[start_str,end_str]] = pos[:start].product(pos[:end])
          @matches[[start_str,end_str]].reject!{ |s,e| e <= s }
          @matches.reject!{ |p,pos| pos.none? }
        end
      end
    
      def generate_output
        tags = []
        counter = shift = 0
        @output = @input.dup
    
        @matches.each do |pattern,positions|
          positions.each do |s,e|
            counter += 1
            tags << [s, "\\start{#{counter}}"]
            tags << [e, "\\end{#{counter}}"  ]
          end
        end
    
        tags.sort!.each do |position,tag|
          @output.insert(position+shift,tag)
          shift += tag.chars.count
        end
      end
    end
    

    在行动:

    patterns = [
      ['CAT' , 'TREE'     ],
      ['LION', 'FOREST'   ],
      ['OWL' , 'WATERFALL']
    ]
    
    strings = [
      'THEREISACATINTHETREE.',
      'THETREEHASACAT.',
      'THECATTREEHASMANYBIRDS.',
      'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
      'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
      'ACATONATREEANDANOTHERCATONANOTHERTREE.',
      'ACATONATREEBUTNOCATTREE.'
    ]
    
    marker = PatternMarker.new(patterns)
    
    strings.each do |string|
      marker.parse(string)
    
      puts "input: #{marker.input}"
    
      if marker.match?
        puts "output: #{marker.output}"
      else
        puts "(does not match)"
      end
      puts
    end
    

    输出:

    input: THEREISACATINTHETREE.
    output: THEREISACAT\start{1}INTHE\end{1}TREE.
    
    input: THETREEHASACAT.
    (does not match)
    
    input: THECATTREEHASMANYBIRDS.
    (does not match)
    
    input: THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
    output: THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
    
    input: THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
    output: THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
    
    input: ACATONATREEANDANOTHERCATONANOTHERTREE.
    output: ACAT\start{1}\start{2}ONA\end{1}TREEANDANOTHERCAT\start{3}ONANOTHER\end{2}\end{3}TREE.
    
    input: ACATONATREEBUTNOCATTREE.
    output: ACAT\start{1}\start{2}ONA\end{1}TREEBUTNOCAT\end{2}TREE.
    

    测试:

    require 'test/unit'
    
    class TestPatternMarker < Test::Unit::TestCase
      def setup
        @patterns = [
          ['CAT' , 'TREE'     ],
          ['LION', 'FOREST'   ],
          ['OWL' , 'WATERFALL']
        ]
    
        @marker = PatternMarker.new(@patterns)
      end
    
      def test_should_parse_simple
        @marker.parse 'THEREISACATINTHETREE.'
        assert @marker.match?
        assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
      end
    
      def test_should_parse_reverse
        @marker.parse 'THETREEHASACAT.'
        assert !@marker.match?
        assert_equal @marker.input, @marker.output
      end
    
      def test_should_parse_touching
        @marker.parse 'THECATTREEHASMANYBIRDS.'
        assert !@marker.match?
        assert_equal @marker.input, @marker.output
      end
    
      def test_should_parse_multiple_patterns
        @marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
        assert @marker.match?
        assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
      end
    
      def test_should_mark_multiple_matches_at_same_place
        @marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
        assert @marker.match?
        assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
      end
    
      def test_should_mark_all_possible_matches
        @marker.parse 'CATFOOTREEFOOCATFOOTREE.'
        assert @marker.match?
        assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
      end
    
      def test_should_accept_input
        @marker.parse 'CATINTREE'
        assert @marker.match?
        assert_equal 'CATINTREE', @marker.input
        @marker.parse 'FOOBAR'
        assert !@marker.match?
        assert_equal 'FOOBAR', @marker.input
      end
    
      def test_should_only_accept_valid_patterns
        assert_raise ArgumentError do PatternMarker.new([])                                end
        assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'])                     end
        assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
        assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ'])             end
        assert_nothing_raised      do PatternMarker.new([['FOO','BAR']])                   end
      end
    end
    

    测试输出:

    Loaded suite pattern
    Started
    ........
    Finished in 0.003910 seconds.
    
    8 tests, 21 assertions, 0 failures, 0 errors, 0 skips
    
    Test run options: --seed 31173
    

    编辑:添加测试并简化部分代码

    【讨论】:

      【解决方案4】:

      这是部分答案。它满足您的所有要求,除了最后一个,它没有单一的简单解决方案。我会把那个留给你去弄清楚:-)

      我选择了基于规则的方法而不是正则表达式。我在以前的类似项目中发现,简单的基于规则的解析器比正则表达式更易于维护、可移植并且通常更快。我在这里没有使用任何真正的 Ruby 特定的特性,所以它应该很容易移植到 Python 或 Perl。它甚至可以毫不费力地移植到 C 语言中。

      patterns = [
        ['CAT', 'TREE'],
        ['LION', 'FOREST'],
        ['OWL', 'WATERFALL']
      ]
      
      lines = [
        'THEREISACATINTHETREE.',
        'THETREEHASACAT.',
        'THECATTREEHASMANYBIRDS.',
        'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
        'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
      ]
      
      newlines = []
      
      START_TAG_LENGTH = 9
      END_TAG_LENGTH = 7
      
      lines.each do |line|
      
        newline = line.dup
        before = {}
        n = 1
      
        patterns.each do |pair|
      
          a = 0
      
          matches = [[], []]
          len = pair[0].length
      
          pair.each do |pattern|
            b = 0
            while (c = line.index(pattern, b))
              matches[a] << c
              b = c + 1
            end
            break if b == 0 && a > 0
            a += 1
          end
      
          matches[0].each_with_index do |d, f|
            bd = 0; be = 0
            e = matches[1][f]
            next if (d > e) || (d + len == e)
            d = d + len
            before.each { |g, h| bd += h if g <= d }
            newline.insert(d + bd, "\\start{#{n}}")
            before[d] ||= 0
            before[d] += START_TAG_LENGTH
            before.each { |g, h| be += h if g <= e }
            newline.insert(e + be, "\\end{#{n}}")
            before[e] ||= 0
            before[e] += END_TAG_LENGTH
          end
      
          n += 1
      
        end
      
        newlines << newline
      
      end
      
      puts newlines
      

      输出:

      THEREISACAT\start{1}INTHE\end{1}TREE.
      THETREEHASACAT.
      THECATTREEHASMANYBIRDS.
      THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
      THECAT\start{1}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORTTREES.
      

      注意最后一个失败了。不过,这应该会给您一个良好的开端。如果您需要帮助弄清楚某些代码的作用,请不要犹豫。

      顺便说一句,只是好奇,你用这个做什么?

      【讨论】:

        【解决方案5】:

        这是我在不幸地不是很流行的 Python 中的解决方案。

        patterns = [u'CAT,TREE', u'LION,FOREST', u'OWL,WATERFALL']
        
        strings = [u'THEREISACATINTHETREE.',
                   u'THETREEHASACAT.',
                   u'THECATTREEHASMANYBIRDS.',
                   u'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
                   u'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
                   u'ACATONATREEANDANOTHERCATONANOTHERTREE.',
                   u'ACATONATREEBUTNOCATTREE.' ]
        
        def findMatch(needles, haystack, label):
            needles = needles.split(',')
            matches = haystack.split(needles[0])
        
            if len(matches) > 1:
                submatches = matches[1].split(needles[1])
        
                if len(submatches) > 1:
                    return u''.join([matches[0], needles[0], u'\\start{'+label+'}', submatches[0], u'\\end{'+label+'}', needles[1], submatches[1]])
        
            return False
        
        for s in strings:
            i = 0
            res = s
            for pat in patterns:
                i = i + 1
                temp = findMatch(pat, res, str(i))
        
                if (temp):
                    res = temp
        
            print ('searching in '+s+' yields '+res).encode('utf-8')
        

        【讨论】:

          【解决方案6】:

          这是一个完全在 bash 中的(没有外部命令)。不是太难!它需要标准输入上的输入行。

          #/bin/bash
          
          words=("CAT TREE" "LION FORREST" "OWL WATERFALL")
          
          function doit () {
            if [[ "$line" =~ (.*)$word1(.*)$word2(.*) ]]; then
              line="${BASH_REMATCH[1]}$alt_w1\\start{$count}${BASH_REMATCH[2]}$word2\\end{$count}${BASH_REMATCH[3]}"
              (( count += 1 ))
              doit
            elif [[ "$line" =~ $alt_w1 ]]; then
              line=${line//$alt_w1/$word1}
              [[ "$line" =~ (.*)$word2(.*) ]]
              line="${BASH_REMATCH[1]}$alt_w2${BASH_REMATCH[2]}"
              doit
            elif [[ "$line" =~ $alt_w2 ]]; then
              line=${line//$alt_w2/$word2}
            fi
          }
          
          while read line; do
            count=1
            for pair in "${words[@]}"; do
              word1=${pair% *}
              word2=${pair#* }
              alt_w1="${word1:0:1}XYZZYX${word1:1}"
              alt_w2="${word2:0:1}XYZZYX${word2:1}"
              doit
            done
            echo "$line"
          done
          

          假设:

          1. 文本永远不会包含“XYZZYX”(字符串可以更改)。
          2. 单词永远不会包含正则表达式中使用的字符。
            • 例如. * [ ] ^ $ +
            • (可以排队)。
          3. 单词的长度始终至少为两个字符。
          4. 这些词永远不会是您正在搜索的其他词的子字符串。
            • 例如catcattle
            • 实际上,这可能行得通,但结果会令人困惑。

          【讨论】:

          • 正则表达式中使用的哪些字符不能出现?如果正则表达式中使用的任何字符出现在输入行中,这是否有问题? “单词永远不会是其他单词的子字符串”是什么意思?
          • 'never be substrings of other words',他的意思是如果你想匹配'cat',你也会匹配'cattle'等等。如果没有单词分隔符,这是不可避免的。
          • 对,如果你想同时匹配 "cattle" 和 "cat",你要么得到 "cat\start{2}tle\start{1}",要么只是 "cat\ start{1}tle",具体取决于您首先搜索的内容。
          【解决方案7】:

          这是我的 PERL 方法。它又快又脏。

          如果我使用 Marpa 来解析而不是正则表达式可能会更好。

          无论如何,它完成了工作。

          use strict;
          use Test::More;
          use Data::Dumper;
          
          # patterns to search for
          my @patterns = (
              'CAT,TREE',
              'LION,FOREST',
              'OWL,WATERFALL',
          );
          #lines
          my @lines = qw(
          THEREISACATINTHETREE.
          THETREEHASACAT.
          THECATTREEHASMANYBIRDS.
          THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
          THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
          THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREESORBIGTREES.
          );
          
          
          my @expected_output = (
          'THEREISACAT\start{1}INTHE\end{1}TREE.',
          'Does not Match',
          'Does not Match',
          'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.',
          'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.',
          'THECAT\start{1}\start{2}\start{3}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREESORBIG\end{3}TREES.',
          );
          
          #is(check_line($lines[0]),$expected_output[0]);die;
          
          my $no=0;
          for(my $i=0;$i<scalar(@lines );$i++){   
              is(check_line($lines[$i]),$expected_output[$i]);
              $no++;
          }
          done_testing( $no );
          
          sub check_line{
              my $in      = shift;
              my $out = '';
              my $match = 1;
              foreach my $pattern_line (@patterns){
                  my ($first,$second) = split(/,/,$pattern_line);
                  #warn "$first,$second,$in\n";
                  if ($in !~ m#$first.+?$second#is){
                      next;
                  }
                  #matched    
          
                  while ($in =~ s#($first)(.+?)($second)#$1\\start\{$match\}$2\\end\{$match\}_SECOND_#is){
                      $match++;
                      #warn "Found match: $match\n";
                  }
                  $in =~ s#_SECOND_#$second#gis;
                  #$in =~ s#\\start\{(\d+)\}\\start\{(\d+)\}#\\start\{$2\}\\start\{$1\}#gis;
                  my ($end,$start) = $in =~ m#\\start\{(\d+)\}(?:\\start\{(\d+)\})+#gis;
          
                  my $stmp = join("",map {"\\start\{$_\}"} ($start..$end));
                  #print Dumper($in,$start,$end,$stmp);
                  $in =~ s#\\start\{($end)\}.*?\\start\{($start)\}#$stmp#is;
          
          
              }
              return 'Does not Match' if $match ==1;
              $out = $in;
              return $out;
          }
          

          【讨论】:

          • 嗨,如果你对我的解决方案投了反对票,你介意发表评论吗,为什么?
          猜你喜欢
          • 2019-04-15
          • 1970-01-01
          • 2021-04-08
          • 2015-03-30
          • 1970-01-01
          • 2012-02-25
          • 1970-01-01
          • 2016-01-02
          • 2020-11-10
          相关资源
          最近更新 更多