【问题标题】：Regex: multiple options for named match正则表达式：命名匹配的多个选项
【发布时间】：2020-09-30 03:41:40
【问题描述】：

我有一个正则表达式模式，它部分地捕捉到了我想要的东西。该模式可以看起来像这些中的任何一个

"caller command"
"caller command specifier"
"caller command 'two-worded specifier'"
"caller 'two-worded command' specifier"
"caller 'two-worded command' 'two-worded specifier'"

我当前的代码将它们匹配到命名组中，并使用 Python 的 re 库文档中显示的是/否模式。

messages = ["your.majesty hello", "proclamation honor Dom", "your.majesty query 'Weekly Coding Challenge'", "your.majesty 'build test' submissions", "your.majesty 'build test' 'Weekly Coding Challenge'"]
call = "(?P<call>.*?)"
command = "(?P<command>'(.*?)'|(.*?))"
specifier = "(?P<specifier>'(.*?.)'|(.*?))"
duo = f"{call}\s{command}"
trio = f"({call}\s{command}\s{specifier})"

regex_duo = re.compile(duo, flags=re.DOTALL)
regex_trio = re.compile(trio)

for msg in messages:
    match = regex_trio.match(msg)
    if match is None:
        match = regex_duo.match(msg)
    print(match)

这个输出是

<re.Match object; span=(0, 13), match='your.majesty '>
<re.Match object; span=(0, 19), match='proclamation honor '>
<re.Match object; span=(0, 44), match="your.majesty query 'Weekly Coding Challenge'">
<re.Match object; span=(0, 26), match="your.majesty 'build test' ">
<re.Match object; span=(0, 51), match="your.majesty 'build test' 'Weekly Coding Challeng>

当我想要时

<re.Match object; span=(0, ...), match='your.majesty hello'>
<re.Match object; span=(0, ...), match='proclamation honor Dom'>
<re.Match object; span=(0, ...), match="your.majesty query 'Weekly Coding Challenge'">
<re.Match object; span=(0, ...), match="your.majesty 'build test' submissions">
<re.Match object; span=(0, ...), match="your.majesty 'build test' 'Weekly Coding Challenge'>

有没有比我目前正在做的更好的方法？
为什么即使我使用贪婪匹配，它也会截断这么多？

【问题讨论】：

鉴于您列表中的每个元素都在这里匹配，您要实现什么目标？

标签： python regex string string-matching

【解决方案1】：

解决方案 1：csv.reader（重用轮子）

只需使用io.StringIO 将问题转换为csv.reader 可读的格式。

代码：

from io import StringIO
import csv

messages = [
    "your.majesty hello",
    "proclamation honor Dom",
    "your.majesty query 'Weekly Coding Challenge'",
    "your.majesty 'build test' submissions",
    "your.majesty 'build test' 'Weekly Coding Challenge'"
]

# Avoid creating StringIO object multiple times
# for s in messages:
#    reader = csv.reader(StringIO(s), delimiter=" ", quotechar="'")

# load at once
ss = "\n".join(messages)
reader = csv.reader(StringIO(ss), delimiter=" ", quotechar="'")    

for row in reader:  # type(row) is a list
    caller = row[0]
    command = row[1]
    specifier = row[2] if len(row) == 3 else ""
    # check
    print(f"caller = {caller}, command = {command}, specifier = {specifier}")
    # do something with the parsed components here

输出：

caller = your.majesty, command = hello, specifier = 
caller = proclamation, command = honor, specifier = Dom
caller = your.majesty, command = query, specifier = Weekly Coding Challenge
caller = your.majesty, command = build test, specifier = submissions
caller = your.majesty, command = build test, specifier = Weekly Coding Challenge

此解决方案不会产生re.match 对象，而是直接将三个组件解析出来。后续操作作为字符串应该比匹配组更容易。

优点是这样的：我们知道现有的 csv 加载器可以正确处理引号和空格分隔的格式，对吧？所以不要重新发明轮子，尝试重复使用它。这样，代码的可维护性也大大提高了。

使用 pandas.read_csv

注意：也可以使用pandas.read_csv() 直接生成pandas.Dataframe。相同的语法适用，除了必须手动分配列名。潜在的缺失列（最后一列）得到妥善处理。

import pandas as pd

pd.read_csv(StringIO(ss), delimiter=" ", quotechar="'", names=["caller", "command", "specifier"])
Out[38]: 
         caller     command                specifier
0  your.majesty       hello                      NaN
1  proclamation       honor                      Dom
2  your.majesty       query  Weekly Coding Challenge
3  your.majesty  build test              submissions
4  your.majesty  build test  Weekly Coding Challenge

解决方案 2：改进的正则表达式（更通用）

对于正则表达式的方式，是的，它也可以改进很多。我个人认为这也值得细说，因为很多（也可能是大部分）解析任务是现有库无法解决的。

摘要：

使用raw-docstring + re.VERBOSE 允许详细记录。（正则表达式在 PyCharm 中的颜色非常舒适。）
更精确地了解匹配模式。一般来说，请避免使用.*，除非匹配的字符真的是任意的。
使用? 量词表示可选存在。

代码：

regex_uni = re.compile(r"""
    (?P<call>\S+) 
    \             # a space character
    (?P<command>  # group 2:
        (?:         # 1st option (non-capturing group):
           '          # begins with SQ
           [^']+      # followed by one or more consecutive non-SQ chars
           '          # ends with SQ
        )
        |         # or
        \S+         # 2nd option: consecutive non-space chars (assuming no SQ)
    )        
    \ ?  # optional space character
    (?P<specifier>       # group 3:   
        (?:'[^']+')|\S+    # same as group 2
    )?                   # but the existence is optional
    """, re.VERBOSE
)

for msg in messages:
    match = regex_uni.match(msg)
    if match is not None:
        print(f"* input = {match.group()}")
        print(f"    call = {match.group('call')}")
        print(f"    command = {match.group('command')}")
        print(f"    specifier = {match.group('specifier')}")

输出：

* input = your.majesty hello
    call = your.majesty
    command = hello
    specifier = None
* input = proclamation honor Dom
    call = proclamation
    command = honor
    specifier = Dom
* input = your.majesty query 'Weekly Coding Challenge'
    call = your.majesty
    command = query
    specifier = 'Weekly Coding Challenge'
* input = your.majesty 'build test' submissions
    call = your.majesty
    command = 'build test'
    specifier = submissions
* input = your.majesty 'build test' 'Weekly Coding Challenge'
    call = your.majesty
    command = 'build test'
    specifier = 'Weekly Coding Challenge'

【讨论】：