使用 perl 检索模式之间的线条答案

【问题标题】：retrieve lines between patterns using perl使用 perl 检索模式之间的线条
【发布时间】：2015-12-09 23:06:49
【问题描述】：

我有一个包含如下列表的文件：

ID: ID_A
attr1: attribute
attr2: name
attr3: city


ID: ID_B
attr1: attribute2
attr2: name2
attr3: city3
attr4: country

该文件包含大约 6 万个此类条目。唯一标识符始终在 ID 行上。找到新 ID 后，我需要能够检索该 ID 的所有属性。

我正在尝试执行以下操作：

if($line=/ID/../ID)
{
    $job[0]=$line
}

但这不起作用，我还必须每次都创建一个足够大或足够小的数组。任何有关如何进行的提示都会有很大帮助。

谢谢。 JS

【问题讨论】：

预期输出是什么？
不只是输出，而且一旦你分离了这些数据，你打算如何使用它？
条目是否总是用空行分隔？
$/ 是你的朋友。

标签： perl pattern-matching

【解决方案1】：

如果您使用$/ - 记录分隔符，这会容易得多。并将其设置为"\n\n"。

但正如 Dave Cross 在 cmets 中所指出的那样 - 将其设置为 '' 可能会更好，因为这样 perl 将跳过多个空白行，否则会获得相同的结果。

#!/usr/bin/perl
use strict;
use warnings;

use Data::Dumper;

#set record separator to (one or more) blank lines
local $/ = '';

#iterate each chunk of data 
while ( <DATA> ) {
    #g matches repeatedly, and so this'll get alternating values
    #this conveniently is what you need to assign straight to a hash 
    my %record = m/(\w+): (.*)/g; 
    print Dumper \%record;
}

__DATA__
ID: ID_A
attr1: attribute
attr2: name
attr3: city

ID: ID_B
attr1: attribute2
attr2: name2
attr3: city3
attr4: country

提取记录/字段后，您可以将它们推送到记录数组中：

push ( @all_records, \%record );

给予：

$VAR1 = [
          {
            'attr2' => 'name',
            'ID' => 'ID_A',
            'attr1' => 'attribute',
            'attr3' => 'city'
          },
          {
            'attr2' => 'name2',
            'ID' => 'ID_B',
            'attr4' => 'country',
            'attr1' => 'attribute2',
            'attr3' => 'city3'
          }
        ];

或者把它放到一个散列中，键入 ID 号：

$all_records{$record{ID}} = \%record;

给予：

$VAR1 = {
          'ID_A' => {
                      'ID' => 'ID_A',
                      'attr3' => 'city',
                      'attr1' => 'attribute',
                      'attr2' => 'name'
                    },
          'ID_B' => {
                      'attr2' => 'name2',
                      'attr3' => 'city3',
                      'attr1' => 'attribute2',
                      'attr4' => 'country',
                      'ID' => 'ID_B'
                    }
        };

取决于您对记录所做的工作 - 如果您只是在处理和丢弃它们，您可能根本不需要“保留”它们，并且如果您有重复的 ID，那么您可能不需要想要使用散列方法的散列（ID 必须是唯一的才能工作）。

【讨论】：

将$/ 设置为空字符串会产生相同的效果。如果（由于某种原因）记录之间有多个空白行，也可以使用。

【解决方案2】：

我会创建一个hash-of-hashes（因为您不知道文件中可能会遇到哪些属性）。主散列的键是ID，每个条目的内容是另一个子散列。该子哈希以属性名称作为键。

这根本不是惯用的 perl，但在我的测试中有效...

#!/usr/bin/perl
use strict;
use Data::Dumper;
my %master;
my %tmphash;
my $oldid="";
my $id;

# Create a hash-of-hashes
while (<>) {
  if (/^ID: (.*)/) {
    $id=$1;
    # We need to skip the first one to "prime the pump"
    if ($oldid ne "") {
      $master{$oldid}={%tmphash};
    }
    $oldid=$id;
    %tmphash=();
  } else {
    # Until we get to the next ID: add anything we find to tmphash
    if (/^(.*): (.*)/) {
      $tmphash{$1}=$2;
    }
  }
}
# Don't forget the last one...
$master{$oldid}={%tmphash};

print Dumper(\%master);

foreach my $id (sort keys %master) {
    foreach my $attr (keys %{ $master{$id} }) {
        print "$id, $attr: $master{$id}{$attr}\n";
    }
}

【讨论】：

【解决方案3】：

在不知道您预期的输出格式或您打算如何使用这些数据的情况下，很难提供一个体面的答案，但这会让您完成 90% 的工作：

use strict;
use warnings;

my %data;
my $id;

while (<DATA>) {
    chomp;
    next unless /\S/;
    my ($key, $value) = split(/\s*:\s*/);

    if ($key eq 'ID') {
        $id = $value;
        next;
    }

    $data{$id}{$key} = $value;
}

print "$data{ID_B}{attr2}\n";  # prints name2

__DATA__
ID: ID_A
attr1: attribute
attr2: name
attr3: city

ID: ID_B
attr1: attribute2
attr2: name2
attr3: city3
attr4: country

【讨论】：