如何从 Perl 中的文本文件中提取/解析表格数据？答案

【问题标题】：How can I extract/parse tabular data from a text file in Perl?如何从 Perl 中的文本文件中提取/解析表格数据？
【发布时间】：2011-04-25 04:39:12
【问题描述】：

我正在寻找类似HTML::TableExtract 的东西，但不是用于 HTML 输入，而是用于包含带有缩进和间距格式的“表格”的纯文本输入。

数据可能如下所示：

Here is some header text.

Column One       Column Two      Column Three
a                                           b
a                    b                      c


Some more text

Another Table     Another Column
abdbdbdb          aaaa

【问题讨论】：

我提供了一个解决方案，但它会产生六列。您是否假设列分隔符必须大于 1 个空格？
不，但我们可以假设我知道列标题字符串，并且列数据在标题下正确对齐。

标签： perl parsing text-parsing data-extraction

【解决方案1】：

这是一个非常快速的解决方案，并带有概述。（我为长度道歉。）基本上，如果一个“单词”出现在列标题 n 的开头之后，那么它会在列 n 中结束，除非它的大部分body 拖到 n + 1 列，在这种情况下，它会在那里结束。整理它、扩展它以支持多个不同的表等都留作练习。您还可以使用列标题的左偏移以外的其他内容作为边界标记，例如中心，或由列号确定的某个值。

#!/usr/bin/perl


use warnings;
use strict;


# Just plug your headers in here...
my @headers = ('Column One', 'Column Two', 'Column Three');

# ...and get your results as an array of arrays of strings.
my @result = ();


my $all_headers = '(' . (join ').*(', @headers) . ')';
my $found = 0;
my @header_positions;
my $line = '';
my $row = 0;
push @result, [] for (1 .. @headers);


# Get lines from file until a line matching the headers is found.

while (defined($line = <DATA>)) {

    # Get the positions of each header within that line.

    if ($line =~ /$all_headers/) {
        @header_positions = @-[1 .. @headers];
        $found = 1;
        last;
    }

}


$found or die "Table not found! :<\n";


# For each subsequent nonblank line:

while (defined($line = <DATA>)) {
    last if $line =~ /^$/;

    push @{$_}, "" for (@result);
    ++$row;

    # For each word in line:

    while ($line =~ /(\S+)/g) {

        my $word = $1;
        my $position = $-[1];
        my $length = $+[1] - $position;
        my $column = -1;

        # Get column in which word starts.

        while ($column < $#headers &&
            $position >= $header_positions[$column + 1]) {
            ++$column;
        }

        # If word is not fully within that column,
        # and more of it is in the next one, put it in the next one.

        if (!($column == $#headers ||
            $position + $length < $header_positions[$column + 1]) &&
            $header_positions[$column + 1] - $position <
            $position + $length - $header_positions[$column + 1]) {

            my $element = \$result[$column + 1]->[$row];
            $$element .= " $word";

        # Otherwise, put it in the one it started in.

        } else {

            my $element = \$result[$column]->[$row];
            $$element .= " $word";

        }

    }

}


# Output! Eight-column tabs work best for this demonstration. :P

foreach my $i (0 .. $#headers) {
    print $headers[$i] . ": ";
    foreach my $c (@{$result[$i]}) {
        print "$c\t";
    }
    print "\n";
}


__DATA__

This line ought to be ignored.

Column One       Column Two      Column Three
These lines are part of the tabular data to be processed.
The data are split based on how much words overlap columns.

This line ought to be ignored also.

样本输出：

第一列：这些行是数据被拆分第二列：表格的一部分，基于如何第三列：要处理的数据。很多单词重叠列。

【讨论】：

【解决方案2】：

不知道任何打包的解决方案，但是假设您可以对文件执行两次传递，那么做一些不太灵活的事情是相当简单的：（以下是部分 Perlish 伪代码示例）

假设：数据可能包含空格，如果有空格，则不会在 CSV 中引用 - 如果不是这种情况，请使用 Text::CSV(_XS)。
假设：没有用于格式化的制表符。
该逻辑将“列分隔符”定义为任何连续的垂直行集，其中 100% 填充有空格。
如果偶然每行都有一个空格，该空格是偏移量 M 个字符处数据的一部分，则逻辑将偏移量 M 视为列分隔符，因为它无法更好地知道。 它可以更好地知道的唯一方法是，如果您要求列分隔至少是 X>1 处的 X 个空格 - 请参阅第二个代码片段。

示例代码：

my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
                             # 0 means from entire file
my $lines_scanned = 0;
my @non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
    last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
    chomp;
    my $line = $_;
    my @chars = split(//, $line); 
    for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map?
        $non_spaces[$i] = 1 if $chars[$i] ne " ";
    }
}
close $fh or die;

# Find columns, defined as consecutive "non-spaces" slices.
my @starts, @ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < @non_spaces; $i++) {
    next if $state eq " " && !$non_spaces[$i];
    next if $state eq "c" && $non_spaces[$i];
    if ($state eq " ") { # && $non_spaces[$i] of course => start column
        $state = "c";
        push @starts, $i;
    } else { # meaning $state eq "c" && !$non_spaces[$i] => end column
        $state = " ";
        push @ends, $i-1;
    }
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
    push @ends, $#non_spaces;
}

# Now split lines
my $fh = open(...) or die;
my @rows = ();
while (<$fh>) {
    my @columns = ();
    push @rows, \@columns;
    chomp;
    my $line = $_;
    for (my $col_num = 0; $col_num < @starts; $col_num++) {
        $columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
    }
}
close $fh or die;

现在，如果您要求列分隔至少是 X>1 处的 X 个空格，这也是可行的，但列位置的解析器需要更复杂一些：

# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my @starts, @ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) {
    if ($state eq "S") { # done with last column, inside a separator
        if ($non_spaces[$i]) { # start a new column
            $state = "c";
            push @starts, $i;
        }
        next;
    }
    if ($state eq "c") { # Processing a column
        if (!$non_spaces[$i]) { # First space after non-space
                                # Could be beginning of separator? check next X chars!
            for (my $j = $i+1; $j < @non_spaces
                            || $j < $i+$min_col_separator_is_X_spaces; $j++) {
                 if ($non_spaces[$j]) {
                     $i = $j++; # No need to re-scan again
                     next NEXT_CHAR; # OUTER loop
                 }
                 # If we reach here, next X chars are spaces! Column ended!
                 push @ends, $i-1;
                 $state = "S";
                 $i = $i + $min_col_separator_is_X_spaces;
            }
         }
        next;
    }
}

【讨论】：