我同意 Matt Jacob 的 answer — 你应该用 Text::CSV 解析 CSV,除非你有充分的理由不这样做。
如果您打算使用正则表达式来处理它,我认为使用m// 会比使用split 做得更好。例如,这似乎涵盖了大多数单行 CSV 数据变体,尽管它不会像 Text::CSV 那样删除引用字段周围的引号 — 这需要单独的后处理步骤。
use strict;
use warnings;
sub splitter
{
my($row) = @_;
my @fields;
my $i = 0;
while ($row =~ m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/g)
{
print "Found [$1]\n";
$fields[$i++] = $1;
}
for (my $j = 0; $j < @fields; $j++)
{
print "$j = [$fields[$j]]\n";
}
}
my $row;
$row = q'ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6';
print "Row 1: $row\n";
splitter($row);
$row = q'ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""';
print "Row 2: $row\n";
splitter($row);
显然,其中包含相当多的诊断代码。输出(来自 Mac OS X 10.11.1 上的 Perl 5.22.0)是:
Row 1: ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
Found [ACC000121]
Found [2290]
Found ["01009900,01009901,01009902,01009903,01009904"]
Found [4]
Found [5]
Found [6]
0 = [ACC000121]
1 = [2290]
2 = ["01009900,01009901,01009902,01009903,01009904"]
3 = [4]
4 = [5]
5 = [6]
Row 2: ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""
Found [ACC000121]
Found [","]
Found [2290]
Found ["01009900,""aux data"",01009902,01009903,01009904"]
Found []
Found [5"abc"]
Found [6]
Found [""]
0 = [ACC000121]
1 = [","]
2 = [2290]
3 = ["01009900,""aux data"",01009902,01009903,01009904"]
4 = []
5 = [5"abc"]
6 = [6]
7 = [""]
在 Perl 代码中,匹配是:
m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/
这会查找并捕获(在$1 中)后跟逗号的空字段,或后跟零个或多个非逗号的双引号以外的其他内容,或后跟一系列零次或多次出现“不是双引号或两个连续的双引号”和另一个双引号;然后它需要一个逗号或字符串结尾。
处理多行字段需要做更多的工作。删除转义的双引号也需要更多的工作。
使用Text::CSV 更简单,更不容易出错(它可以处理比这更多的变体)。