首先,您必须从模式中找到所有出现的开始和结束字符串。然后您需要找出哪些标签适合在一起(如果结束字符串位于起始字符串之前或位于相同位置并因此接触,则它们不适合)。然后你可以生成你的标签并插入到你的输出字符串中。请注意,您需要将插入的字符数添加到您的位置,因为插入标签时字符串的长度会发生变化。此外,您必须在插入标签之前按位置对标签进行排序,否则计算起来会变得非常复杂,您必须将位置移动多远。这是 Ruby 中的一个简短示例:
patterns = [['CAT','TREE'], ['LION','FOREST'], ['OWL','WATERFALL']]
strings = ['THEREISACATINTHETREE.', 'THETREEHASACAT.', 'THECATTREEHASMANYBIRDS.', 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.', 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.', 'ACATONATREEANDANOTHERCATONANOTHERTREE.', 'ACATONATREEBUTNOCATTREE.']
strings.each do |string|
matches = {}; tags = []
counter = shift = 0
output = string.dup
patterns.each do |sstr,estr| # loop through all patterns
posa = []; posb = []; #
string.scan(sstr){posa << $~.end(0)} # remember found positions and
string.scan(estr){posb << $~.begin(0)} # find all valid combinations (next line)
matches[[sstr,estr]] = posa.product(posb).reject{|s,e|s>=e}
end
matches.each do |pat,pos| # loop through all matches
pos.each do |s,e| #
tags << [s,"\\start{#{counter += 1}}"] # generate and remember \start{}
tags << [e,"\\end{#{counter}}"] # and \end{} tags
end
end
tags.sort.each do |pos,tag| # sort and loop through tags
output.insert(pos+shift,tag) # insert tag and increment
shift += tag.chars.count # shift by num. of inserted chars
end
puts string, output # print result
end
它并不漂亮,但它满足您的所有要求。我认为下一个示例更具可读性和可重用性,它被实现为具有相应单元测试的 Ruby 类,以确保其正常工作:
class PatternMarker
require 'english'
attr_reader :input, :output, :matches
def initialize patterns
@patterns = patterns
raise ArgumentError, 'no patterns given' unless @patterns.any?
@patterns.each do |p|
raise ArgumentError, 'every pattern must have exactly two strings' unless p.count == 2
end
end
def parse input
@input = input.dup
match_patterns
generate_output
self
end
def match?
@matches.any?
end
private
def match_patterns
@matches = {}
@patterns.each do |start_str,end_str|
pos = { :start => [], :end => [] }
@input.scan(start_str){ pos[:start] << $LAST_MATCH_INFO.end(0) }
@input.scan(end_str ){ pos[:end] << $LAST_MATCH_INFO.begin(0) }
@matches[[start_str,end_str]] = pos[:start].product(pos[:end])
@matches[[start_str,end_str]].reject!{ |s,e| e <= s }
@matches.reject!{ |p,pos| pos.none? }
end
end
def generate_output
tags = []
counter = shift = 0
@output = @input.dup
@matches.each do |pattern,positions|
positions.each do |s,e|
counter += 1
tags << [s, "\\start{#{counter}}"]
tags << [e, "\\end{#{counter}}" ]
end
end
tags.sort!.each do |position,tag|
@output.insert(position+shift,tag)
shift += tag.chars.count
end
end
end
在行动:
patterns = [
['CAT' , 'TREE' ],
['LION', 'FOREST' ],
['OWL' , 'WATERFALL']
]
strings = [
'THEREISACATINTHETREE.',
'THETREEHASACAT.',
'THECATTREEHASMANYBIRDS.',
'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
'ACATONATREEANDANOTHERCATONANOTHERTREE.',
'ACATONATREEBUTNOCATTREE.'
]
marker = PatternMarker.new(patterns)
strings.each do |string|
marker.parse(string)
puts "input: #{marker.input}"
if marker.match?
puts "output: #{marker.output}"
else
puts "(does not match)"
end
puts
end
输出:
input: THEREISACATINTHETREE.
output: THEREISACAT\start{1}INTHE\end{1}TREE.
input: THETREEHASACAT.
(does not match)
input: THECATTREEHASMANYBIRDS.
(does not match)
input: THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
output: THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
input: THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
output: THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
input: ACATONATREEANDANOTHERCATONANOTHERTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEANDANOTHERCAT\start{3}ONANOTHER\end{2}\end{3}TREE.
input: ACATONATREEBUTNOCATTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEBUTNOCAT\end{2}TREE.
测试:
require 'test/unit'
class TestPatternMarker < Test::Unit::TestCase
def setup
@patterns = [
['CAT' , 'TREE' ],
['LION', 'FOREST' ],
['OWL' , 'WATERFALL']
]
@marker = PatternMarker.new(@patterns)
end
def test_should_parse_simple
@marker.parse 'THEREISACATINTHETREE.'
assert @marker.match?
assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
end
def test_should_parse_reverse
@marker.parse 'THETREEHASACAT.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_touching
@marker.parse 'THECATTREEHASMANYBIRDS.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_multiple_patterns
@marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
assert @marker.match?
assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
end
def test_should_mark_multiple_matches_at_same_place
@marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
assert @marker.match?
assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
end
def test_should_mark_all_possible_matches
@marker.parse 'CATFOOTREEFOOCATFOOTREE.'
assert @marker.match?
assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
end
def test_should_accept_input
@marker.parse 'CATINTREE'
assert @marker.match?
assert_equal 'CATINTREE', @marker.input
@marker.parse 'FOOBAR'
assert !@marker.match?
assert_equal 'FOOBAR', @marker.input
end
def test_should_only_accept_valid_patterns
assert_raise ArgumentError do PatternMarker.new([]) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ']) end
assert_nothing_raised do PatternMarker.new([['FOO','BAR']]) end
end
end
测试输出:
Loaded suite pattern
Started
........
Finished in 0.003910 seconds.
8 tests, 21 assertions, 0 failures, 0 errors, 0 skips
Test run options: --seed 31173
编辑:添加测试并简化部分代码