【问题标题】:Perl - sorting complex data structurePerl - 对复杂数据结构进行排序
【发布时间】:2020-06-24 23:54:22
【问题描述】:

我有一个旧的 perl 项目,一个事件日志的文本解析器,并收到了一个按事件 ID 对输出进行排序并删除重复事件的请求。因此解析器读取一个文本文件并将每个事件放入一个数组中。数组中的每个字段都包含一个带有多个键 -> 值对的散列。一个键称为序列,它包含事件的编号。我现在想根据每个数组字段的序列值对数组进行排序。其次,我想从数组中删除重复的相同序列号。

这里有一些我如何创建数组和散列的代码,以便您了解数据结构:

open (my $mel, "<", $in_filename) or die "\nFile '$in_filename' does not exist or is not readable.\n";

my $i=0;
my $eventcount = 0;

while (<$mel>) {

        # Separate events by "Date/Time" :
        if (/^$/) {
            next;
        }
        if (/^Date\/Time:\s(.*)$/) {
            if ($eventcount >0) {
                $i++;
            }
            $eventcount++; # eventcount initialized with ‘0’
        }

        # Gathering information of the MEL event :
        if (/^Date\/Time:\s(.*)$/) {$MEL[$i]{date} = $1; next;}
        if (/^Sequence number:\s(\d+)$/) {$MEL[$i]{sequence} = $1; next;}
        if (/^Event type:\s([0-9|a-f|A-F]{1,6})$/) {$MEL[$i]{type} = lc $1; next;}
        if (/^Event category:\s(\w+)$/) {$MEL[$i]{category} = $1; next;}
        if (/^Priority:\s(\w+)/) {$MEL[$i]{priority} = $1; next;}
        if (/^Description:\s(.*)$/) {$MEL[$i]{description} = $1; next;}
        if (/^Event specific codes:\s(.*)$/) {$MEL[$i]{code} = $1; next;}
        if (/^Component location:\s(.*)$/) {$MEL[$i]{location} = $1; next;}
        if (/^Logged by:\s.*(.)$/) {$MEL[$i]{logged_by} = $1; next;}
        if (/^4[dD]\s45\s4[cC]\s48\s(\d\d)/) {$MEL[$i]{version} = hex $1;}

}

文本文件中的事件示例:

Date/Time: 2/3/20, 12:18:20 PM
Sequence number: 200 <==============
Event type: 5023
Event category: Command
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Controller return status/function call for requested operation
Event specific codes: b8/1/0
Component type: Controller
Component location: Shelf 99, Bay A
Logged by: Controller in bay A

所以基本上我想根据散列中键的值对包含对散列的引用的数组进行排序。

其次,我想从数组中删除一个字段,此时键的值也存在于不同的数组字段中。

我希望有人能理解我的需要:-)

这可能吗?

【问题讨论】:

  • 哪个字段是“事件ID”?

标签: arrays perl sorting hash unique


【解决方案1】:

您可以使用自定义排序块对数组进行排序:

my @sorted = sort { $a->{sequence} <=> $b->{sequence} } @MEL;

但是使用散列的散列而不是散列的数组要容易得多。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my $in_filename = ... ;
open my $mel, '<', $in_filename or die $!;

my %event;

my ($current, $id);
while (<$mel>) {

    next if /^$/;

    if (m{^Date/Time:\s(.*)$}) {
        if (defined $id) {
            $event{$id} = $current;
        }
        $current = { date => $1 };
    } elsif (/^Sequence number:\s(\d+)$/) {
        $id = $1;
    } elsif (/^Event type:\s([0-9|a-f|A-F]{1,6})$/) {
        $current->{type} = lc $1;
    } elsif (/^Event category:\s(\w+)$/) {
        $current->{category} = $1;
    } elsif (/^Priority:\s(\w+)/) {
        $current->{priority} = $1;
    } elsif (/^Description:\s(.*)$/) {
        $current->{description} = $1;
    } elsif (/^Event specific codes:\s(.*)$/) {
        $current->{code} = $1;
    } elsif (/^Component location:\s(.*)$/) {
        $current->{location} = $1;
    } elsif (/^Logged by:\s.*(.)$/) {
        $current->{logged_by} = $1;
    } elsif (/^4[dD]\s45\s4[cC]\s48\s(\d\d)/) {
        $current->{version} = hex $1;
    }
}

for my $e (sort { $a <=> $b } keys %event) {
    say 'Sequence number:', $e;
    for my $k (sort keys %{ $event{$e} }) {
        say "$k: $event{$e}{$k}";
    }
}

可以通过构建一个大的正则表达式来匹配大部分细节来进一步简化:

my $regex = qr/
               Event\ type:\s(?<type>[0-9|a-f|A-F]{1,6})$
              |Event\ category:\s(?<category>\w+)$
              |Priority:\s(?<priority>\w+)
              |Description:\s(?<description>.*)$
              |Event\ specific\ codes:\s(?<code>.*)$
              |Component\ location:\s(?<location>.*)$
              |Logged\ by:\s.*(?<logged>.)$
              |4[dD]\s45\s4[cC]\s48\s(?<version>\d\d)
/x;

while (<$mel>) {
    next if /^$/;

    if (m{^Date/Time:\s(.*)$}) {
        if (defined $id) {
            $current->{type} = lc $current->{type}
                if exists $current->{type};
            $current->{version} = hex $current->{version}
                if exists $current->{version};
            $event{$id} = $current;
        }
        $current = { date => $1 };
    } elsif (/^Sequence number:\s(\d+)$/) {
        $id = $1;
    } elsif (/^$regex/) {
        $current->{ (keys %+)[0] } = (values %+)[0];
    } else {
        warn "Skipping: $_";
    }
}

【讨论】:

  • 一个不错的正则表达式课程。但是当没有下一个Date/Time 这样做时,我看不出这将如何在%event 中注册最后一条记录。
  • 非常感谢。数组的第一个答案有效。我看到使用散列更好,但我有大约 1000 行代码循环遍历这些值几次。将数组更改为哈希需要我重写所有代码。
  • 知道如何根据序列号删除重复的数组吗?
  • 啊,找到了:我的 %seen; @MEL = grep { ! $seen{$_->{sequence}}++ } @MEL;
  • @SiegfriedHepp:如果您使用第二种解决方案,它会为您处理唯一的。
【解决方案2】:

问题描述不完整。不清楚这些记录是否是同质的(都是相同的类型)。

如果上面的假设是正确的,那么任务就很简单了。

将文件拆分为记录,然后用事件编号作为键填充散列,并记录为值跳过重复项。

然后按key hash排序并输出记录。

use strict;
use warnings;
use feature 'say';

my %events;
my %seen;
my $data = do { local $/; <DATA> };

$data =~ s!\n(Date/Time)!\n\n$1!g;

my @data = split '\n\n', $data;

for my $record (@data) {
    my $event = get_event_n( $record );

    next if $seen{$event};

    $seen{$event}   = 1;
    $events{$event} = $record;
}

say '----- Sorted Events -----';

for my $event (sort keys %events) {
    say $events{$event};
    say '-' x 45;                 # record separator as visual indicator
}

sub get_event_n {
    my $record = shift;
    my $sequence;

    $record =~ /Sequence number:\s+(\d+)/;
    $sequence = $1;

    return $sequence;
}

__DATA__
Date/Time: 2/3/20, 12:19:20 PM
Sequence number: 230
Event type: 5023
Event category: Command
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Controller return status/function call for requested operation
Event specific codes: b8/1/0
Component type: Controller
Component location: Shelf 99, Bay A
Logged by: Controller in bay A
Date/Time: 2/3/20, 12:18:20 PM
Sequence number: 200
Event type: 5023
Event category: Command
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Controller return status/function call for requested operation
Event specific codes: b8/1/0
Component type: Controller
Component location: Shelf 99, Bay A
Logged by: Controller in bay A
Date/Time: 2/3/20, 12:18:25 PM
Sequence number: 205
Event type: 5023
Event category: Command
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Controller return status/function call for requested operation
Event specific codes: b8/1/0
Component type: Controller
Component location: Shelf 99, Bay B
Logged by: Controller in bay B
Date/Time: 2/3/20, 12:18:28 PM
Sequence number: 209
Event type: 5023
Event category: Command
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Controller return status/function call for requested operation
Event specific codes: b8/1/0
Component type: Controller
Component location: Shelf 92, Bay B
Logged by: Controller in bay B
Date/Time: 2/3/20, 12:18:25 PM
Sequence number: 205
Event type: 5023
Event category: Command
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Controller return status/function call for requested operation
Event specific codes: b8/1/0
Component type: Controller
Component location: Shelf 99, Bay B
Logged by: Controller in bay B

【讨论】:

    【解决方案3】:

    我的答案是基于@choroba 的漂亮正则表达式,但我认为这个更简单:

    my $key = 'sequence';  #or other fields
    my $keep = 'first';    #or 'last' record with identical $key
    
    my $regex = qr{
       Date/Time:              \s* (?<date>.*)
      |Sequence\ number:       \s* (?<sequence>\d+)
      |Event\ type:            \s* (?<type>[0-9|a-f|A-F]{1,6})
      |Event\ category:        \s* (?<category>\w+)
      |Priority:               \s* (?<priority>\w+)
      |Description:            \s* (?<description>.*)
      |Event\ specific\ codes: \s* (?<code>.*)
      |Component\ location:    \s* (?<location>.*)
      |Logged\ by:             \s* (?<logged_by>.*)
      |4[dD]\s45\s4[cC]\s48\s(?<version>\d\d)
    }x;
    
    my @event=();
    while (<>) {
      m{^Date/Time:} and push @event, {};
      m{^$regex}     and @{$event[-1]}{keys %+} = values %+;
    }
    
    #special treatment for type and version: hex and lc
    exists $$_{type}    and $$_{type}    = hex $$_{type}    for @event;
    exists $$_{version} and $$_{version} = lc  $$_{version} for @event;
    
    #mark for deletion
    my %exists; $exists{$$_{$key}}++ and $$_{delete}=1
       for $keep eq 'first' ? @event
         : $keep eq 'last'  ? reverse(@event)
         : die "keep must be first or last";
    
    #delete those marked
    @event = grep !$$_{delete}, @event;
    
    #sort by $key
    @event = sort { $$a{$key} <=> $$b{$key} } @event;
    

    我猜想那个类型应该是hexed,版本应该是lced,而不是像问题中的相反。

    运行方式:

    perl script.pl input_file
    

    【讨论】:

      猜你喜欢
      • 2022-11-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-05-12
      • 1970-01-01
      相关资源
      最近更新 更多