匹配相同数量的字符重复作为捕获组的重复答案

【问题标题】：Match same number of repetitions of character as repetitions of captured group匹配相同数量的字符重复作为捕获组的重复
【发布时间】：2016-12-27 10:27:14
【问题描述】：

我想用 python 和正则表达式清理从我的键盘记录的一些输入。尤其是在使用退格键修复错误时。

示例 1：

[in]:  'Helloo<BckSp> world'
[out]: 'Hello world'

这可以通过

re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

示例 2：
但是，当我有几个退格时，我不知道如何删除之前完全相同数量的字符：

[in]:  'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'

（这里我想去掉两个退格前的'l'和'o'）。

我可以简单地使用re.sub(r'[^>]<BckSp>', '', line) 几次，直到没有<BckSp> 离开，但我想找到一个更优雅/更快的解决方案。

有人知道怎么做吗？

【问题讨论】：

我认为你不能指望正则表达式，只是按照建议循环遍历你的正则表达式是最好的方法
使用正则表达式是一项要求（即您正在学习正则表达式）还是只是您提出的解决方案？
是的，我尝试使用正则表达式来学习，因为我还不熟悉它。
请记住，虽然可能有一些只有正则表达式的解决方案没有循环，但正则表达式不是首选，在这种情况下，您最好使用更简单、更易于理解的解决方案.
感谢您的建议，我会记住这一点，然后可能不会在这种情况下使用正则表达式:)

标签： python regex backreference

【解决方案1】：

看起来 Python 不支持递归正则表达式。如果你可以使用其他语言，你可以试试这个：

.(?R)?<BckSp>

见：https://regex101.com/r/OirPNn/1

【讨论】：

好吧，也可以安装 PyPi 正则表达式模块并在 Python 中使用这种方法。

【解决方案2】：

这不是很有效，但你可以用 re 模块做到这一点：

(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

demo

这样就不用数了，模式只用了重复。

(?: 
    [^<] # a character to remove
    (?=  # lookahead to reach the corresponding <BckSp>
        [^<]* # skip characters until the first <BckSp>
        (  # capture group 1: contains the <BckSp>s
            (?=(\1?))\2 # emulate an atomic group in place of \1?+
                        # The idea is to add the <BcKSp>s already matched in the
                        # previous repetitions if any to be sure that the following
                        # <BckSp> isn't already associated with a character
            <BckSp> # corresponding <BckSp>
        )
    )
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>

\1 # matches all the consecutive <BckSp> and ensures that there's no more character
   # between the last character to remove and the first <BckSp>

你可以对正则表达式模块做同样的事情，但这次你不需要模拟所有格量词：

(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

demo

但是使用正则表达式模块，您也可以使用递归（正如@Fallenhero 注意到的那样）：

[^<](?R)?<BckSp>

demo

【讨论】：

除了演示之外，如果没有任何解释，不能投票给这个。

【解决方案3】：

由于不支持递归/子例程调用，Python re 中不支持原子组/占有量词，您可以在循环中删除这些字符，后跟退格：

import re
s = "Helllo\b\bo world"
r = re.compile("^\b+|[^\b]\b")
while r.search(s): 
    s = r.sub("", s)
print(s)

见Python demo

"^\b+|[^\b]\b" 模式将在字符串开头找到 1+ 个退格字符（使用^\b+），[^\b]\b 将查找除退格后跟退格之外的所有非重叠字符。

如果退格表示为一些实体/标签，如文字<BckSp>，则使用相同的方法：

import re
s = "Helllo<BckSp><BckSp>o world"
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S)
while r.search(s): 
    s = r.sub("", s)
print(s)

见another Python demo

【讨论】：

OP 已经考虑了循环，正在寻找更好的解决方案。

【解决方案4】：

略显冗长，但您可以使用此 lambda function 来计算出现 <BckSp> 的次数，并使用子字符串例程来获得最终输出。

>>> bk = '<BckSp>'

>>> s = 'Helllo<BckSp><BckSp>o world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> worl<BckSp>d'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello word

>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello work

【讨论】：

【解决方案5】：

如果标记是单个字符，您可以使用堆栈，这将在单次传递中为您提供结果：

s = "Helllo\b\bo world"
res = []

for c in s:
    if c == '\b':
        if res:
            del res[-1]
    else:
        res.append(c)

print(''.join(res)) # Hello world

如果标记实际上是 '<BckSp>' 或其他长度大于 1 的字符串，您可以使用 replace 将其替换为 '\b' 并使用上述解决方案。这仅在您知道输入中没有出现'\b' 时才有效。如果您不能指定替代字符，您可以使用split 并处理结果：

s = 'Helllo<BckSp><BckSp>o world'
res = []

for part in s.split('<BckSp>'):
    if res:
        del res[-1]
    res.extend(part)

print(''.join(res)) # Hello world

【讨论】：

不错的方法。如果标记是 <BckSp> ，您会有解决方法吗？也许用\b 替换它会是最简单的......
@LouisM 如果您知道输入中没有出现的字符，替换将是最简单的选择。对于无法指定任何单个字符用作替代的情况，我添加了替代解决方案。