如何在 Python 中将长正则表达式规则拆分为多行答案

【问题标题】：How to split long regular expression rules to multiple lines in Python如何在 Python 中将长正则表达式规则拆分为多行
【发布时间】：2011-12-21 20:03:38
【问题描述】：

这真的可行吗？我有一些很难理解的很长的正则表达式模式规则，因为它们不能立即融入屏幕。示例：

test = re.compile('(?P<full_path>.+):\d+:\s+warning:\s+Member\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) of (class|group|namespace)\s+(?P<class_name>.+)\s+is not documented' % (self.__MEMBER_TYPES), re.IGNORECASE)

反斜杠或三引号不起作用。

编辑。我结束了使用 VERBOSE 模式。下面是正则表达式模式现在的样子：

test = re.compile('''
  (?P<full_path>                                  # Capture a group called full_path
    .+                                            #   It consists of one more characters of any type
  )                                               # Group ends                      
  :                                               # A literal colon
  \d+                                             # One or more numbers (line number)
  :                                               # A literal colon
  \s+warning:\s+parameters\sof\smember\s+         # An almost static string
  (?P<member_name>                                # Capture a group called member_name
    [                                             #   
      ^:                                          #   Match anything but a colon (so finding a colon ends group)
    ]+                                            #   Match one or more characters
   )                                              # Group ends
   (                                              # Start an unnamed group 
     ::                                           #   Two literal colons
     (?P<function_name>                           #   Start another group called function_name
       \w+                                        #     It consists on one or more alphanumeric characters
     )                                            #   End group
   )*                                             # This group is entirely optional and does not apply to C
   \s+are\snot\s\(all\)\sdocumented''',           # And line ends with an almost static string
   re.IGNORECASE|re.VERBOSE)                      # Let's not worry about case, because it seems to differ between Doxygen versions

【问题讨论】：

re.VERBOSE example
@J.F. Sebastian：我不得不为 re.DEBUG 单独 +1，这将使我未来的生活变得更加轻松！
@JFSebastian：我在链接后面支持你的答案，因为最后我仍然使用它，即使它需要更多编辑（必须确保每个空格都被正确标记）。跨度>
cmets 的文字样式，例如') # Group ends' 不是很有用。我在我的示例中使用它只是为了回答相应的问题。在实际代码中，您应该假设读者已经知道() 在正则表达式中的含义。逻辑与代码 cmets 相同。这是better example（注意：(?x) 扮演re.VERBOSE 的角色）。
顺便说一句，@N3dst4's answer 通过启用语法突出显示，为(?x) 提供了更好的替代方案。您也可以使用[ ] 或\ 转义空格。

标签： python regex

【解决方案1】：

您可以通过引用每个段来拆分您的正则表达式模式。不需要反斜杠。

test = re.compile(('(?P<full_path>.+):\d+:\s+warning:\s+Member'
                   '\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '
                   'of (class|group|namespace)\s+(?P<class_name>.+)'
                   '\s+is not documented') % (self.__MEMBER_TYPES), re.IGNORECASE)

您还可以使用原始字符串标志'r'，您必须将它放在每个段之前。

见the docs。

【讨论】：

【解决方案2】：

来自http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation：

多个相邻的字符串文字（由空格分隔），可能允许使用不同的引用约定，它们的含义是与它们的串联相同。因此，"hello" 'world' 是等价的到“你好世界”。此功能可用于减少需要反斜杠，以便将长字符串方便地拆分为长字符串行，甚至将 cmets 添加到部分字符串中，例如：

re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )

请注意，此功能是在句法级别定义的，但是在编译时实现。 “+”运算符必须用于在运行时连接字符串表达式。还要注意文字连接可以为每个组件使用不同的引用样式（甚至混合原始字符串和三引号字符串）。

【讨论】：

【解决方案3】：

Python 编译器会自动连接相邻的字符串文字。因此，您可以这样做的一种方法是将正则表达式分解为多个字符串，每行一个，然后让 Python 编译器重新组合它们。字符串之间有多少空格并不重要，因此您可以使用换行符甚至前导空格来有意义地对齐片段。

【讨论】：

【解决方案4】：

要么使用 naeg 的答案中的字符串连接，要么使用 re.VERBOSE/re.X，但请注意此选项将忽略空格和 cmets。您的正则表达式中有一些空格，因此这些将被忽略，您需要转义它们或使用\s

例如

test = re.compile("""(?P<full_path>.+):\d+: # some comment
    \s+warning:\s+Member\s+(?P<member_name>.+) #another comment
    \s+\((?P<member_type>%s)\)\ of\ (class|group|namespace)\s+
    (?P<class_name>.+)\s+is\ not\ documented""" % (self.__MEMBER_TYPES), re.IGNORECASE | re.X)

【讨论】：

我先尝试了这个，但没有成功。也许我犯了一些错误，但我最初的想法是 Python 包含了空格。至少当我以那种风格打印某些东西时，也会打印空格。

【解决方案5】：

就我个人而言，我不使用re.VERBOSE，因为我不喜欢转义空格，并且我不想在 '\s' 代替空格时>'\s' 不是必需的。
正则表达式模式中的符号相对于必须捕获的字符序列越精确，正则表达式对象的动作就越快。我几乎从不使用 '\s'

.

为避免re.VERBOSE，您可以按照已经说过的方式进行：

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' # comment
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented'\
% (self.__MEMBER_TYPES),

re.IGNORECASE)

将字符串推到左边给写 cmets 提供了很大的空间。

.

但是这种方式在模式很长的时候不太好，因为不可能写

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' % (self.__MEMBER_TYPES)  # !!!!!! INCORRECT SYNTAX !!!!!!!
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented',

re.IGNORECASE)

那么如果模式很长，那么之间的行数
结尾部分% (self.__MEMBER_TYPES)
以及应用它的字符串'(?P<member_type>%s)'
可能很大，我们会失去阅读模式的便利性。

.

这就是为什么我喜欢用一个元组来写一个很长的模式：

pat = ''.join((
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % (self.__MEMBER_TYPES), # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

.

这种方式允许将模式定义为函数：

def pat(x):

    return ''.join((\
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % x , # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

test = re.compile(pat(self.__MEMBER_TYPES), re.IGNORECASE)

【讨论】：

【解决方案6】：

为了完整起见，此处缺少的 answer 使用了 OP 最终指出的 re.X 或 re.VERBOSE 标志。除了保存引号外，此方法还可以移植到 Perl 等其他正则表达式实现中。

来自https://docs.python.org/2/library/re.html#re.X：

re.X
re.VERBOSE

此标志允许您通过允许您在视觉上分离模式的逻辑部分并添加 cmets 来编写看起来更漂亮且更具可读性的正则表达式。模式中的空格被忽略，除非在字符类中或前面有未转义的反斜杠。当一行包含一个不在字符类中的 # 并且前面没有未转义的反斜杠时，从最左边的 # 到行尾的所有字符都将被忽略。

这意味着以下两个匹配十进制数的正则表达式对象在功能上是相等的：

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)

b = re.compile(r"\d+\.\d*")

【讨论】：