【发布时间】:2010-10-06 04:33:23
【问题描述】:
在我的 Python 应用程序中,我需要编写一个正则表达式来匹配以分号 (;) 终止的 C++ for 或 while 循环。例如,它应该匹配这个:
for (int i = 0; i < 10; i++);
...但不是这个:
for (int i = 0; i < 10; i++)
这乍一看似乎微不足道,直到您意识到左括号和右括号之间的文本可能包含其他括号,例如:
for (int i = funcA(); i < funcB(); i++);
我正在使用 python.re 模块。现在我的正则表达式看起来像这样(我把我的 cmets 留在里面,这样你就可以更容易理解了):
# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\( # match the initial opening parenthesis
# Now make a named group 'balanced' which matches a balanced substring.
(?P<balanced>
# A balanced substring is either something that is not a parenthesis:
[^()]
| # …or a parenthesised string:
\( # A parenthesised string begins with an opening parenthesis
(?P=balanced)* # …followed by a sequence of balanced substrings
\) # …and ends with a closing parenthesis
)* # Look for a sequence of balanced substrings
\) # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*
这对上述所有情况都非常有效,但是一旦您尝试使 for 循环的第三部分包含一个函数,它就会中断,如下所示:
for (int i = 0; i < 10; doSomethingTo(i));
我认为它会中断,因为一旦您在左括号和右括号之间放置了一些文本,“平衡”组就会匹配包含文本,因此 (?P=balanced) 部分不再起作用,因为它不会匹配(由于括号内的文字不同)。
在我的 Python 代码中,我使用了 VERBOSE 和 MULTILINE 标志,并像这样创建正则表达式:
REGEX_STR = r"""# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\( # match the initial opening parenthesis
# Now make a named group 'balanced' which matches
# a balanced substring.
(?P<balanced>
# A balanced substring is either something that is not a parenthesis:
[^()]
| # …or a parenthesised string:
\( # A parenthesised string begins with an opening parenthesis
(?P=balanced)* # …followed by a sequence of balanced substrings
\) # …and ends with a closing parenthesis
)* # Look for a sequence of balanced substrings
\) # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*"""
REGEX_OBJ = re.compile(REGEX_STR, re.MULTILINE| re.VERBOSE)
谁能建议改进这个正则表达式?这对我来说太复杂了。
【问题讨论】:
标签: c++ python regex parsing recursion