【问题标题】：variable length masking with preg_replace使用 preg_replace 进行可变长度掩码
【发布时间】：2014-01-29 18:53:24
【问题描述】：

我正在使用preg_replace_callback() 屏蔽字符串中单引号（包括）之间的所有字符。但如果可能的话，我只想使用preg_replace()，但一直无法弄清楚。任何帮助将不胜感激。

这就是我使用preg_replace_callback() 所产生的正确输出：

function maskCallback( $matches ) {
    return str_repeat( '-', strlen( $matches[0] ) );
}
function maskString( $str ) {
    return preg_replace_callback( "('.*?')", 'maskCallback', $str );
}

$str = "TEST 'replace''me' ok 'me too'";
echo $str,"\n";
echo $maskString( $str ),"\n";

输出是：

TEST 'replace''me' ok 'me too'
TEST ------------- ok --------

我尝试过使用：

preg_replace( "/('.*?')/", '-', $str );

但破折号被消耗，例如：

TEST -- ok -

我尝试过的所有其他方法也不起作用。（我显然不是正则表达式专家。）这可能吗？如果有，怎么做？

【问题讨论】：

如果没有回调，你无法获取匹配的strlen。您需要单独匹配每个字符。不用说，这里的回调要好得多。为什么不想使用回调？
尽可能简化。谢谢
我觉得这可以使用正则表达式并且接受的解决方案是错误的......但是当我有更多时间时，我将不得不回过头来。
如果它简化了事情，我意识到替换单引号对于我的特定问题是可选的。替换除单引号之外的所有单引号对之间的字符的解决方案也可以。使用 preg_replace_callback() 时替换单引号更容易。
看看下面的 2 个正则表达式解决方案，我真的希望你保留你的回调解决方案。它更具可读性。

标签： php regex preg-replace

【解决方案1】：

是的，你可以做到，（假设引号是平衡的）示例：

$str = "TEST 'replace''me' ok 'me too'";
$pattern = "~[^'](?=[^']*(?:'[^']*'[^']*)*+'[^']*\z)|'~";    
$result = preg_replace($pattern, '-', $str);

这个想法是：如果一个字符是一个引号，或者它后面跟着奇数个引号，你可以替换它。

不带引号：

$pattern = "~(?:(?!\A)\G|(?:(?!\G)|\A)'\K)[^']~";
$result = preg_replace($pattern, '-', $str);

模式只会匹配一个字符，当它与前一个匹配连续（换句话说，当它紧跟在最后一个匹配之后），或者当它前面有一个与前一个匹配不连续的引号时。

\G是最后一次匹配后的位置（开头是字符串的开头）

图案细节：

~             # pattern delimiter

(?: # non capturing group: describe the two possibilities
    # before the target character

    (?!\A)\G  # at the position in the string after the last match
              # the negative lookbehind ensure that this is not the start
              # of the string

  |           # OR

    (?:       # (to ensure that the quote is a not a closing quote)
        (?!\G)   # not contiguous to a precedent match
      |          # OR
        \A       # at the start of the string
    )
    '         # the opening quote

    \K        # remove all precedent characters from the match result
              # (only one quote here)
)

[^']          # a character that is not a quote

~

请注意，由于模式不匹配结束引号，因此无法匹配以下不是引号的字符，因为没有先例匹配。

编辑：

(*SKIP)(*FAIL) 方式：

您可以使用回溯控制动词(*SKIP) 和(*FAIL) 来破坏右引号上的匹配连续性，而不是像前面的模式那样测试单引号是否不是带有(?:(?!\G)|\A)' 的右引号（这可以是缩短为(*F))。

$pattern = "~(?:(?!\A)\G|')(?:'(*SKIP)(*F)|\K[^'])~";
$result = preg_replace($pattern, '-', $str);

由于模式在每个右引号上都失败，因此在下一个左引号之前不会匹配以下字符。

这样写的模式可能更高效：

$pattern = "~(?:\G(?!\A)(?:'(*SKIP)(*F))?|'\K)[^']~";

（您也可以使用(*PRUNE) 代替(*SKIP)。）

【讨论】：

谢谢。这确实解决了问题。后来我意识到替换单引号并不是真正需要的，但使用 preg_replace_callback() 更容易做到这一点。如果我不需要替换周围的单引号，您的解决方案可以简化吗？谢谢
@Alan：我添加了一个不带引号的版本，该模式确实比第一个更短更高效。（更少的测试）
感谢大家的意见！使用这个（最短的正则表达式）解决方案。

【解决方案2】：

简答：有可能！！！

使用以下模式

'                                     # Match a single quote
(?=                                   # Positive lookahead, this basically makes sure there is an odd number of single quotes ahead in this line
   (?:(?:[^'\r\n]*'){2})*   # Match anything except single quote or newlines zero or more times followed by a single quote, repeat this twice and repeat this whole process zero or more times (basically a pair of single quotes)
   (?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # You guessed, this is to match a single quote until the end of line
)
|                                     # or
\G(?<!^)                              # Preceding contiguous match (not beginning of line)
[^']                                  # Match anything that's not a single quote
(?=                                   # Same as above
   (?:(?:[^'\r\n]*'){2})*             # Same as above
   (?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # Same as above
)
|
\G(?<!^)                              # Preceding contiguous match (not beginning of line)
'                                     # Match a single quote

确保使用m 修饰符。

Online demo.

长答案：这很痛苦 :)

除非不仅您，而且您的整个团队都喜欢正则表达式，否则您可能会考虑使用此正则表达式，但请记住，这对于初学者来说太疯狂且很难掌握。可读性（几乎）始终是第一位的。

我将打破我是如何编写这样一个正则表达式的想法：

1) 我们首先需要知道我们真正想要替换的内容，我们想要用连字符替换两个单引号之间的每个字符（包括单引号）。
2) 如果我们要使用preg_replace()，这意味着我们的模式每次都需要匹配一个字符。
3) 所以第一步很明显： '.
4) 我们将使用 \G 表示匹配字符串的开头或我们之前匹配的连续字符。举这个简单的例子~a|\Gb~。这将匹配a 或b（如果它在开头）或b（如果上一个匹配是a）。请参阅此demo。
5)我们不希望与字符串开头有任何关系，因此我们将使用\G(?<!^)。
6) 现在我们需要匹配任何不是单引号的东西 ~'|\G(?<!^)[^']~。
7) 现在开始真正的痛苦，我们怎么知道上面的模式不会匹配 c在'ab'c 中？好吧，我们需要计算单引号...

让我们回顾一下：

a 'bcd' efg 'hij'
  ^ It will match this first
   ^^^ Then it will match these individually with \G(?<!^)[^']
      ^ It will match since we're matching single quotes without checking anything
        ^^^^^ And it will continue to match ...

我们想要的可以在这 3 条规则中完成：

a 'bcd' efg 'hij'
1 ^ Match a single quote only if there is an odd number of single quotes ahead
2  ^^^ Match individually those characters only if there is an odd number of single quotes ahead
3     ^ Match a single quote only if there was a match before this character

8) 如果我们知道如何匹配偶数，则可以检查是否存在奇数个单引号：

(?:              # non-capturing group
   (?:           # non-capturing group
      [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
      '          # Match a single quote
   ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
)*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes

9) 现在奇数很容易，我们只需要添加：

(?:
   [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
   '             # Match a single quote
   [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
   (?:\r?\n|$)   # End of line
)

10) 在一次前瞻中合并以上内容：

(?=
   (?:              # non-capturing group
      (?:           # non-capturing group
         [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
         '          # Match a single quote
      ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
   )*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
   (?:
      [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
      '             # Match a single quote
      [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
      (?:\r?\n|$)   # End of line
   )
)

11) 现在我们需要合并之前定义的所有 3 条规则：

~                   # A modifier
#################################### Rule 1 ####################################
'                   # A single quote
(?=                 # Lookahead to make sure there is an odd number of single quotes ahead
   (?:              # non-capturing group
      (?:           # non-capturing group
         [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
         '          # Match a single quote
      ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
   )*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
   (?:
      [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
      '             # Match a single quote
      [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
      (?:\r?\n|$)   # End of line
   )
)

|                   # Or

#################################### Rule 2 ####################################
\G(?<!^)            # Preceding contiguous match (not beginning of line)
[^']                # Match anything that's not a single quote
(?=                 # Lookahead to make sure there is an odd number of single quotes ahead
   (?:              # non-capturing group
      (?:           # non-capturing group
         [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
         '          # Match a single quote
      ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
   )*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
   (?:
      [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
      '             # Match a single quote
      [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
      (?:\r?\n|$)   # End of line
   )
)

|                   # Or

#################################### Rule 3 ####################################
\G(?<!^)            # Preceding contiguous match (not beginning of line)
'                   # Match a single quote
~x

Online regex demo. Online PHP demo

【讨论】：

哇，疯了。在我的特定问题中，我不会有奇数个单引号，所以这比我的问题需要的复杂一些，但我很欣赏彻底的解决方案。也许它会帮助其他可能遇到奇数单引号问题的人。谢谢！
@Alan 好吧，我说的是“奇数”，因为我们匹配的第一个会使其“偶数”。有点混乱是啊...

【解决方案3】：

好吧，只是为了好玩，我真的不推荐这样的东西，因为我尽量避免不必要的环顾，这是一个使用'回到未来':

(?<=^|\s)'(?!\s)|(?!^)(?<!'(?=\s))\G.

regex101 demo

好的，分为两部分：

1.匹配开头的单引号

(?<=^|\s)'(?!\s)

我认为应该在这里建立的规则是：

在开始引号之前应该有^ 或\s（因此是(?<=^|\s)）。
开头引号后没有\s（因此是(?!\s)）。

2。匹配引号内的内容和结束引号

(?!^)\G(?<!'(?=\s)).

我认为应该在这里建立的规则是：

字符可以是任何字符（因此.）
匹配长度为 1 个字符，紧跟上一个匹配（因此为 (?!^)\G）。
在它之前不应该有单引号，它本身后跟一个空格（因此
(?<!'(?=\s))，这是 '回到未来' 部分）。这实际上不会匹配前面有' 的\s，并将标记单引号之间的字符的结尾。换句话说，结束引号将被标识为单引号，后跟\s。

如果你喜欢图片...

【讨论】：

多洛瑞安在哪里？
@CasimiretHippolyte 呃，我没有得到参考：s
经过一番搜索，它是 DeLorean（不是 Dolorean）。
@CasimiretHippolyte 它在后面。该技术被an article 称为“回到未来”，zx81 提到。
@Kobi 确实，必须有一种方法来决定什么是紧密引用，什么不是，这是我能想到的唯一方法。使用 backrefs 也不起作用，因为捕获不会从上一场比赛中“结转”（如果是连续比赛，则 backrefs 将是一个选项）。此外，假设 'Sally's dog' 应该被替换，上面的正则表达式会正确地这样做。所以，这是一个灰色地带的 IMO，肯定需要更多的规范。