正则表达式提取任意数量的子模式答案

【问题标题】：Regex extract arbitrary number of subpatterns正则表达式提取任意数量的子模式
【发布时间】：2022-01-11 23:56:27
【问题描述】：

我的句子结构为“Name has digit1 word1, digit2 word2, ..., and digitN wordN”，其中子模式“digit word”的数量因句子而异，因此不确定。最后一个子模式之前有一个“and”。例如“爱丽丝有 1 个苹果、2 个香蕉、……和 6 个橙子。”

如何在 python 中使用正则表达式提取这些数字和单词？我希望输出如下：

姓名，

Digit	Word
digit1	word1
digit2	word2
...	...
digitN	wordN

我尝试了以下方法：

s = 'Alice has 1 apple, 2 bananas, and 3 oranges.'
import re
matches = re.finditer(r'([Aa-z]+) has (\d) ([a-z]+)( and)*', s)
for match in matches:
  print(match.groups())

但这只会给我（'Alice'，'1'，'apple'，None），缺少'2'，'bananas'，'3'，'oranges'。

【问题讨论】：

嘿我做了你建议的改变。请看一看！

标签： python regex repeat

【解决方案1】：

使用PyPi regex。

See Python code：

import regex
s = 'Alice has 1 apple, 2 bananas, and 3 oranges.'
matches = regex.finditer(r'(?P<word1>[A-Za-z]+) has(?:(?:\s+|,\s+|,?\s+and\s+)?(?P<number>\d+)\s+(?P<word2>[a-z]+))*', s)
for match in matches:
  print(match.capturesdict())

结果：{'word1': ['Alice'], 'number': ['1', '2', '3'], 'word2': ['apple', 'bananas', 'oranges']}

【讨论】：

很高兴知道这个正则表达式库！有没有办法同时获取每个匹配的数字和单词的开始和结束位置？
@LPat Easy.
这太棒了！但是，非捕获组是否必要？ This 似乎也可以工作，除非我遗漏了什么？
@LPat 习惯的力量：这里不需要，他们可以在其他情况下提供帮助。
为此添加另一层复杂性，我希望你能帮助我：如果我在水果名称前有“大/小”这个词，即 2 个大苹果，我该怎么做排除“大/小”，只提取“苹果”？

【解决方案2】：

如果您想在单个正则表达式中匹配所有内容，您需要这样的内容：

([^\s]+) has (?:(?:,\s+)?(?:and\s+)?(\d+)\s+([^\s,]+)){1,}

Regex Demo

但是，我不确定 python 是否可以处理重复组。至少，我还没有找到从 python 对象中拉出重复组的方法。

以下是我建议的解决问题的方法：

import re

s = 'Alice has 1 apple, 2 bananas, and 3 oranges.'

matches = re.match(r'^([^\s]+)', s)
print(f'Name: {matches.group(0)}')

matches = re.findall(r'(?:(?:,\s+)?(?:and\s+)?(\d+)\s+([^\s,]+))', s)

for match in matches:
    print(f'{match[0]} - {match[1]}')

样本输出

Name: Alice
1 - apple
2 - bananas
3 - oranges.

Process finished with exit code 0

正则表达式说明

^([^\s]+) - 很少有不同的方法来解决这个问题，但它只是抓取所有内容，直到字符串中的第一个空格。

(?:           - Non-capturing group
 (?:,\s+)?    - Optionally allow the string to have a `,` followed by spaces
 (?:and\s+)?  - Optionally allow the string to contain the word `and` followed by spaces
 (\d+)        - Must have a number
 \s+          - Spaces between number and description
 ([^\s,]+)    - Grab the next set of characters and stop when you find a space or comma. This should be the word (e.g. apple)
)

第二个正则表达式只是确保您可以提取各种形式的1 apple。所以它基本上会匹配以下模式：

1 apple
, 1 apple
, and 1 apple
and 1 apple

从长远来看，解析器更适合这些问题。您会在句子中得到更多的变化，并且使用简单的正则表达式开始解析变得非常困难。

【讨论】：

好文章！我从中学到了很多 - 遗憾的是 re 不允许同时提取名称和重复组。你在我发布我的问题后的一个小时内写了这篇文章也给我留下了深刻的印象。真的很感激！