我同意其他评论者的观点,即有更好的方法来开发这样的解析器。
不过,我想提出两个建议:
while(/\G((?:[^;'"\\]++|'[^']*+'|"[^"]*+"|\\.)*;)/gx){
print " command: $1\n";
# process the match . . .
}
此代码适用于您给定的示例。但是,例如bash 允许转义双引号字符串内的双引号,例如echo "\""。
如果您的 shell 也应该接受这样的代码,则必须扩展正则表达式:
while(/\G( # anchor for beginning
(?:[^;'"\\]++ # harmless chars
|'[^']*+' # or single-quoted string
|"(?: # or double-quoted string,
[^"\\]++ # containing harmless chars
|\\. # or an escaped char
)*+" # with arbitrary many repetitions
|\\. # or an escaped char
)*+ # with arbitrary many repetitions
;) # end with semi-colon
/gx){
print " command: $1\n";
# process the match . . .
}
这种纯正则表达式解决方案非常容易出错。而且,您发现必须处理的异常越多,模式就越复杂,调试代码就越困难。
一些测试:
use strict;
use warnings;
use Test::More tests => 16;
my $samples = [
{"'some ; text ;'" => []},
{'echo;' => ['echo;']},
{'echo ";ignore; inside ;" ; echo \'something;\' \; \'else\';' => [
'echo ";ignore; inside ;" ;', ' echo \'something;\' \; \'else\';']},
{'echo moep; echo moep;' => [ 'echo moep;', ' echo moep;']},
{'echo \a ; echo moep;' => [ 'echo \a ;', ' echo moep;']},
{'echo \\a ; echo moep;' => [ 'echo \\a ;', ' echo moep;']},
{'echo \\\a ; echo moep;' => [ 'echo \\\a ;', ' echo moep;']},
{'echo \; echo moep;' => [ 'echo \; echo moep;']},
{'echo \\; echo moep;' => [ 'echo \; echo moep;']}, # '\\;' eq '\;' !
{'echo \\\; echo moep;' => [ 'echo \\\;', ' echo moep;']},
{'echo ";\';\';"; echo moep;' => [ 'echo ";\';\';";', ' echo moep;']},
{'echo "\";"; echo moep;' => [ 'echo "\";";', ' echo moep;']},
{'echo ";\""; echo moep;' => [ 'echo ";\"";', ' echo moep;']},
{'echo "\";\""; echo moep;' => [ 'echo "\";\"";', ' echo moep;']},
{'echo ";\\\\"; echo moep;' => [ 'echo ";\\\\";', ' echo moep;']},
{'echo "\\\\\";\""; echo moep;' => [ 'echo "\\\\\";\"";', ' echo moep;']},
];
for my $sample(@$samples){
while(my ($line, $test) = each %$sample){
my @result = $line =~ /\G((?:[^;'"\\]++|'[^']*+'|"(?:[^"\\]++|\\.)*+"|\\.)*+;)/g;
is_deeply(\@result, $test, $line);
}
}
不过,您仍然可以轻松找到许多误报/误报样本。例如,我没有处理括号。使用recursive subpatterns 会使上述模式变得更加复杂。