在逗号上拆分字符串，忽略逗号、括号、括号中的大括号、引号答案

【问题标题】：Split string on commas ignoring commas, brackets, braces in parenthesis, quotes在逗号上拆分字符串，忽略逗号、括号、括号中的大括号、引号
【发布时间】：2018-04-14 21:01:11
【问题描述】：

我正在尝试拆分逗号分隔的列表。我想使用正则表达式忽略括号、方括号、大括号和引号中的逗号。更准确地说，我正在尝试在 postgres POSIX regexp_split_to_array 中执行此操作。

我对正则表达式的了解不是很好，通过搜索堆栈溢出，我能够得到部分解决方案，如果字符串不包含嵌套括号、方括号、大括号，我可以拆分它。这是正则表达式：

,(?![^()]*+\))(?![^{}]*+})(?![^\[\]]*+\])(?=(?:[^"]|"[^"]*")*$)

测试用例：

0, (1,2), (1,2,(1,2)) [1,2,3,[1,2]], [1,2,3], "text, text (test)", {a1:1, a2:3, a3:{a1=1, s2=2}, a4:"asasad, sadsas, asasdasd"}

Here is the demo

问题在于，即 (1,2,(1,2)) 中，如果存在嵌套括号，则前 2 个逗号会匹配。

【问题讨论】：

如果可能的话，那将是相当困难的，所以可能生成的正则表达式不会很好地执行。用 PL/Perl 或其他完成这项工作的过程语言编写一个函数。
正则表达式不是匹配嵌套结构的最佳工具。但是，如果仍然需要，请查看 Regular Expression Recursion 或 Matching Nested Constructs with Balancing Groups。

标签： regex postgresql

【解决方案1】：

尽管正则表达式不是最好的方法，但这里有一个递归匹配的解决方案：

(?>(?>\([^()]*(?R)?[^()]*\))|(?>\[[^[\]]*(?R)?[^[\]]*\])|(?>{[^{}]*(?R)?[^{}]*})|(?>"[^"]*")|(?>[^(){}[\]", ]+))(?>[ ]*(?R))*

如果我们把它分解，里面有一个组，里面有一些东西，后面是更多相同类型的匹配，用可选的空格分隔。

(?>               <---- start matching
   ...            <---- some stuff inside
)                 <---- end matching
(?>
   [ ]*           <---- optional spaces
   (?R)           <---- match the entire thing again
)*                <---- can be repeated

从你的例子0, (1,2), (1,2,(1,2)) [1,2,3,[1,2]], [1,2,3],...，我们想要匹配：

0
(1,2)
(1,2,(1,2)) [1,2,3,[1,2]]
[1,2,3]
...

对于第三个匹配，里面的东西会匹配(1,2,(1,2))和[1,2,3,[1,2]]，它们之间用空格隔开。

里面的东西是一系列的选项：

(?>
   (?>...)|       <---- will match balanced ()
   (?>...)|       <---- will match balanced []
   (?>...)|       <---- will match balanced {}
   (?>...)|       <---- will match "..."
   (?>...)        <---- will match anything else without space or comma
)

以下是选项：

\(                <---- literal (
  [^()]*          <---- any number of chars except ( or )
  (?R)?           <---- match the entire thing optionally
  [^()]*          <---- any number of chars except ( or )
\)                <---- literal )

\[                <---- literal [
  [^[\]]*         <---- any number of chars except [ or ]
  (?R)?           <---- match the entire thing optionally
  [^[\]]*         <---- any number of chars except [ or ]
\]                <---- literal ]

{                 <---- literal {
 [^{}]*           <---- any number of chars except { or }
 (?R)?            <---- match the entire thing optionally
 [^{}]*           <---- any number of chars except { or }
}                 <---- literal }

"                 <---- literal "
 [^"]*            <---- any number of chars except "
"                 <---- literal "

[^(){}[\]", ]+    <---- one or more chars except comma, or space, or these: (){}[]"

注意这不匹配一个逗号分隔的列表，而是这样一个列表中的项目。在上面的最后一个选项中排除逗号和空格会导致它在逗号或空格处停止匹配（除了我们在重复匹配之间明确允许的空格）。

【讨论】：