如何在 perl 中解析多行、固定宽度的文件？答案

【问题标题】：How to parse multiple line, fixed-width file in perl?如何在 perl 中解析多行、固定宽度的文件？
【发布时间】：2011-12-16 00:48:35
【问题描述】：

我有一个文件需要按以下格式解析。（所有分隔符都是空格）：

field name 1:            Multiple word value.
field name 2:            Multiple word value along
                         with multiple lines.
field name 3:            Another multiple word
                         and multiple line value.

我熟悉如何解析单行定宽文件，但不知道如何处理多行。

【问题讨论】：

标签： perl parsing fixed-width

【解决方案1】：

#!/usr/bin/env perl

use strict; use warnings;

my (%fields, $current_field);

while (my $line = <DATA>) {
    next unless $line =~ /\S/;

    if ($line =~ /^ \s+ ( \S .+ )/x) {
        if (defined $current_field) {
            $fields{ $current_field} .= $1;
        }
    }
    elsif ($line =~ /^(.+?) : \s+ (.+) \s+/x ) {
        $current_field = $1;
        $fields{ $current_field } = $2;
    }
}

use Data::Dumper;
print Dumper \%fields;

__DATA__
field name 1:            Multiple word value.
field name 2:            Multiple word value along
                         with multiple lines.
field name 3:            Another multiple word
                         and multiple line value.

【讨论】：

谢谢！我将.+ 的第一个实例更改为.+?，以使模式匹配变得不贪婪。这有助于我处理包含“：”字符的值。
如果多行值包含冒号怎么办？
@TLP 查看我的修复。此外，如果文件格式指定值只能在某个列之后开始，这将使工作更容易。
@SinanÜnür 看起来不错。不过，我仍然认为unpack 可能是更好的工具。他确实说它是固定宽度，所以列应该对齐。
@TLP 谢谢。文件中的字段可能是固定宽度，但值字段的起始位置可能因文件而异。在这种情况下，可以自动检测值字段的开始位置等，但我认为现在不值得。

【解决方案2】：

固定宽度对我说unpack。可以使用正则表达式进行解析和拆分，但unpack 应该是更安全的选择，因为它是固定宽度数据的正确工具。

我将第一个字段的宽度设置为 12，并将中间的空白空间设置为 13，这适用于该数据。你可能需要改变它。模板"A12A13A*" 的意思是“找到 12 个然后是 13 个 ascii 字符，然后是任意长度的 ascii 字符”。 unpack 将返回这些匹配项的列表。此外，如果未提供字符串，unpack 将使用 $_，这就是我们在这里所做的。

请注意，如果第一个字段直到冒号的宽度不是固定的，因为它似乎在您的示例数据中，您需要合并模板中的字段，例如“A25A*”，然后去掉冒号。

我选择数组作为存储设备，因为我不知道你的字段名称是否唯一。哈希将覆盖具有相同名称的字段。数组的另一个好处是它保留了数据在文件中出现的顺序。如果这些事情无关紧要，并且优先考虑快速查找，请改用哈希。

代码：

use strict;
use warnings;
use Data::Dumper;

my $last_text;
my @array;
while (<DATA>) {
    # unpack the fields and strip spaces
    my ($field, undef, $text) = unpack "A12A13A*";  
    if ($field) {   # If $field is empty, that means we have a multi-line value
            $field =~ s/:$//;             # strip the colon
        $last_text = [ $field, $text ];   # store data in anonymous array
        push @array, $last_text;          # and store that array in @array
    } else {        # multi-line values get added to the previous lines data
        $last_text->[1] .= " $text"; 
    }
}

print Dumper \@array;

__DATA__
field name 1:            Multiple word value.
field name 2:            Multiple word value along
                         with multiple lines.
field name 3:            Another multiple word
                         and multiple line value
                         with a third line

输出：

$VAR1 = [
          [
            'field name 1:',
            'Multiple word value.'
          ],
          [
            'field name 2:',
            'Multiple word value along with multiple lines.'
          ],
          [
            'field name 3:',
            'Another multiple word and multiple line value with a third line'
          ]
        ];

【讨论】：

【解决方案3】：

你可以这样做：

#!/usr/bin/perl

use strict;
use warnings;

my @fields;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";

for (<$fh>) {
    if (/^\s/) {
        $fields[$#fields] .= $_;    
    } else {
        push @fields, $_;
    }
}

close $fh;

如果该行以空格开头，则将其附加到@fields 中的最后一个元素，否则将其推送到数组的末尾。

或者，对整个文件进行 slurp 并通过环视进行拆分：

#!/usr/bin/perl

use strict;
use warnings;

$/=undef;

open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";

my @fields = split/(?<=\n)(?!\s)/, <$fh>;

close $fh;

但这不是推荐的方法。

【讨论】：

【解决方案4】：

您可以更改分隔符：

$/ = "\nfield name";

while (my $line = <FILE>) {

    if ($line =~ /(\d+)\s+(.+)/) {
        print "Record $1 is $2";
    }
}

【讨论】：