正则表达式在字符上拆分字符串，内部字符串除外答案

【问题标题】：Regex split string on a char with exception for inner-string正则表达式在字符上拆分字符串，内部字符串除外
【发布时间】：2023-03-05 21:30:01
【问题描述】：

我有一个类似aa | bb | "cc | dd" | 'ee | ff' 的字符串，我正在寻找一种方法来拆分它，以获取由| 字符分隔的所有值，其中包含在字符串中的| 例外。

我们的想法是得到这样的东西[a, b, "cc | dd", 'ee | ff']

我已经在这里找到了类似问题的答案：https://stackoverflow.com/a/11457952/11260467

但是我找不到一种方法来适应具有多个分隔符的情况，这里有没有人在正则表达式方面比我更笨？

【问题讨论】：

多个分隔符是什么意思？
我的意思是，如果在两个' 或" 之间找到|，则不应拆分字符串
And an idea with preg_split()

标签： php regex string split

【解决方案1】：

这很容易通过(*SKIP)(*FAIL) 功能pcre 提供：

(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*

在PHP 这可能是：

<?php

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";

$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';

$splitted = preg_split($pattern, $string);
print_r($splitted);
?>

并且会产生

Array
(
    [0] => aa
    [1] => bb
    [2] => "cc | dd"
    [3] => 'ee | ff'
)

请参阅 a demo on regex101.com 和 on ideone.com。

【讨论】：

【解决方案2】：

如果您匹配零件（而不是拆分），这会更容易。模式默认是贪婪的，它们会消耗尽可能多的字符。这允许在为不带引号的标记提供模式之前为带引号的字符串定义更复杂的模式：

$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';

$pattern = <<<'PATTERN'
(
    (?:[|[]|^) # after | or [ or string start
    \s*
    (?<token> # name the match
        "[^"]*" # string in double quotes
        |
        '[^']*'  # string in single quotes
        |
        [^\s|]+ # non-whitespace 
    )
    \s*
)x
PATTERN;

preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);

输出：

array(4) {
  [0]=>
  string(2) "aa"
  [1]=>
  string(2) "bb"
  [2]=>
  string(9) ""cc | dd""
  [3]=>
  string(9) "'ee | ff'"
}

提示：

<<<'PATTERN' 被称为 HEREDOC 语法并减少了转义
我使用 () 作为模式分隔符 - 它们是第 0 组
命名匹配使代码更具可读性
修饰符 x 允许缩进和注释模式

【讨论】：

【解决方案3】：

使用

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));

见PHP proof。

结果：

Array
(
    [0] => aa
    [1] => bb
    [2] => cc | dd
    [3] => ee | ff
)

解释

--------------------------------------------------------------------------------
  (?|                      Branch reset group, does not capture:
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^\"]*                   any character except: '\"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^|'\"]+                 any character except: '|', ''', '\"'
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \|                       '|'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of grouping

【讨论】：

您的方法是走向 IMO 的方式。但为了避免修剪结果，我会将第三个分支更改为that。

【解决方案4】：

有趣的是，有很多方法可以为这个问题构造正则表达式。这是另一个类似于@Jan 的答案。

(['"]).*?\1\K| *\| *

PCRE Demo

(['"]) # match a single or double quote and save to capture group 1
.*?    # match zero or more characters lazily
\1     # match the content of capture group 1
\K     # reset the starting point of the reported match and discard
       # any previously-consumed characters from the reported match
|      # or
\ *    # match zero or more spaces
\|     # match a pipe character
\ *    # match zero or more spaces

请注意，管道字符（“或”）之前的部分仅用于将引擎的内部字符串指针移动到刚刚结束的右引号或带引号的子字符串。

【讨论】：