在 perl 中创建数组数组并从数组中删除答案

【问题标题】：creating an array of arrays in perl and deleting from the array在 perl 中创建数组数组并从数组中删除
【发布时间】：2016-09-15 11:40:55
【问题描述】：

我写这篇文章是为了避免 O(n!) 时间复杂度，但我现在只有伪代码，因为有些事情我不确定要实现。

这是我要传递给此脚本的文件格式。数据按第三列排序——起始位置。

93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
...
...
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530

代码说明：

我想创建一个数组来查找两条信息何时具有重叠长度。

输入文件的第 3 列和第 4 列是单个轨道线上的开始和停止位置。如果任何 row(x) 在第 3 列中的位置短于任何 row(y) 中第 4 列中的位置，则这意味着 x 在 y 结束之前开始并且存在一些重叠。

我想找到与 asnyrow 重叠的每一行，而不必将每一行与每一行进行比较。因为它们是排序的，所以我只需将一个字符串添加到表示一行的数组的内部数组中。如果正在查看的新行不与数组中已有的行之一重叠，则（因为数组按第三列排序）没有其他行能够与数组中的行重叠并且可以将其删除.

这就是我的想法

#!/usr/bin/perl -w

use strict;

my @array

while (<>) {

    my thisLoop = ($id, $name, $begin, $end) = split;
    my @innerArray = split; # make an inner array with the current line, to 
                            # have strings that will be printed after it

    push @array(@innerArray)

    for ( @array ) { # loop through the outer array being made to see if there 
                     # are overlaps with the current item

        if ( $begin > $innerArray[3]) # if there are no overlaps then print 
                                      # this inner array and remove it
                                      # (because it is sorted and everything
                                      # else cannot overlap because it is 
                                      # larger)
            # print @array[4-]
            # remove this item from the array
        else
            # add to array this string
            "$id overlap with innerArray[0] \t innerArray[0]: $innerArray[2], $innerArray[3] "\t" $id :  $begin, $end         
            # otherwise because there is overlap add a statement to the inner
            # array explaining the overlap

代码应该产生类似

87 overlap with 93     93: 1 82      87: 1 7982
76 overlap with 93     93: 1 82      76: 1 20690
65 overlap with 93     93: 1 82      65: 2 170
76 overlap with 87     87: 1 7912    76: 2 20690
65 overlap with 87     87: 1 7912    65: 2 170
65 overlap with 76     76: 2 20690   65: 2 170
256 overlap with 76    76: 2 20690   256: 17515 66740
228 overlap with 166   166: 72503 123150   228: 72510 114530

这很难解释，所以如果你有任何问题可以问我

【问题讨论】：

相关：Quickest way to determine range overlap in Perl，它还询问查找范围集的重叠。在这种情况下，我认为您可以将您的集合与自身进行比较。
"如果任何 row(x) 在第 3 列中的位置短于任何 row(y) 中第 4 列中的位置，则这意味着 x 在 y 结束之前开始有一些重叠” 你确定吗？如果行 x 从 10 开始并在 20 停止，而行 y 从 30 开始并在 40 停止，则 10 小于 40 但没有重叠。如果您的数据以某种方式排序，您可能是正确的，但您所说的通常不是正确的。
你说的“单轨线”是什么意思？
@ThisSuitIsBlackNot：很好的来源。谢谢
@B.Monster：你要处理多少条记录？你的文件有多大？

标签： arrays perl multidimensional-array

【解决方案1】：

我将发布的输入和输出文件用作所需内容的指南。

关于复杂性的说明。原则上，必须将每一行与所有后续行进行比较。实际执行的操作数量取决于数据。由于声明数据在要比较的字段上排序，一旦重叠停止，就可以切断内部循环迭代。最后是关于复杂性估计的评论。

这会将每一行与其后面的行进行比较。为此，首先将所有行读入数组。如果数据集非常大，则应改为逐行读取，然后将程序转过来，将当前读取的行与之前的所有行进行比较。这是一个非常基本的方法。最好先构建辅助数据结构，可能使用合适的库。

use warnings;
use strict;

my $file = 'data_overlap.txt';
my @lines = do { 
    open my $fh, '<', $file or die "Can't open $file -- $!";
    <$fh>;
};

# For each element compare all following ones, but cut out 
# as soon as there's no overlap since data is sorted
for my $i (0..$#lines) 
{  
    my @ref_fields = split '\s+', $lines[$i];
    for my $j ($i+1..$#lines) 
    {   
        my @curr_fields = split '\s+', $lines[$j]; 
        if ( $ref_fields[-1] > $curr_fields[-2] ) { 
            print "$curr_fields[0] overlap with $ref_fields[0]\t" .
                "$ref_fields[0]: $ref_fields[-2] $ref_fields[-1]\t" .
                "$curr_fields[0]: $curr_fields[-2] $curr_fields[-1]\n";
        }   
        else { print "\tNo overlap, move on.\n"; last }
    }   
}

使用文件'data_overlap.txt' 中的输入打印

87 与 93 重叠 93：1 82 87：1 7912 76 与 93 重叠 93：1 82 76：2 20690 65 与 93 重叠 93：1 82 65：2 170 没有重叠，继续。 76 与 87 重叠 87：1 7912 76：2 20690 65 与 87 重叠 87：1 7912 65：2 170 没有重叠，继续。 65 与 76 重叠 76：2 20690 65：2 170 256 与 76 重叠 76：2 20690 256：17515 66740 没有重叠，继续。没有重叠，继续。没有重叠，继续。 228 与 166 重叠 166：72503 123150 228：72510 114530

关于复杂性的评论

最坏情况 必须将每个元素相互比较（它们都重叠）。这意味着对于每个元素，我们需要N-1 比较，并且我们有N 元素。这是O(N^2) 复杂性。这种复杂性不适合经常使用的操作以及潜在的大型数据集，例如库所做的操作。但对于特定问题来说，这并不一定是坏事——数据集仍然需要非常大才能导致运行时间过长。

最佳情况 每个元素只比较一次（完全没有重叠）。这意味着 N 比较，因此 O(N) 复杂性。

平均让我们假设每个元素与“少数”接下来的元素重叠，假设为 3（三个）。这意味着会有3N 比较。这仍然是O(N) 复杂性。只要比较的次数不依赖于列表的长度（而是恒定的），这就是成立的，这是一个非常合理的典型场景。这很好。

感谢 ikegami 在评论中提出这一点以及估算。

请记住，一项技术的计算复杂性的重要性取决于它的用途。

【讨论】：

OP 的数据按 start 列排序。原帖里说的不是很清楚，希望改进一下
@Borodin 哦......那么这个“答案”可能需要完全去......审查。谢谢！
也许不是。我认为我们都忽略了避免特定复杂性的要求，这是正确的，因为这应该只是实验软件的目标。 OP 没有解释任何真正的限制，据我们所知，他的文件只有七行左右的数据
@Borodin 对。再看一下——我认为他们无法获得复杂性目标。如果数据未排序，则必须对所有内容进行比较。我不明白怎么可能知道什么时候把它剪掉。至于长度，考虑到输入的样子，我认为可以合理猜测先将其全部拉入数组中是可行的。
分析：最坏情况：O(N^2)。平均情况：O(N)，因为重叠的运行通常很短。

【解决方案2】：

在给定样本数据作为输入的情况下，这将产生您所要求的输出。它在不到一毫秒的时间内运行

您是否还有其他未解释的限制？让你的代码运行得更快本身不应该是目的。 O(n!) 时间复杂度本身并没有错：这是您必须考虑的执行时间，如果您的代码足够快，那么您的工作就是完成

use strict;
use warnings 'all';

my @data = map [ split ], grep /\S/, <DATA>;

for my $i1 ( 0 .. $#data ) {

    my $v1 = $data[$i1];

    for my $i2 ( $i1 .. $#data ) {

        my $v2 = $data[$i2];

        next if $v1 == $v2;

        unless ( $v1->[3] < $v2->[2] or $v1->[2] > $v2->[3] ) {
            my $statement = sprintf "%d overlap with %d", $v2->[0], $v1->[0];
            printf "%-22s %d: %d %-7d %d: %d %-7d\n", $statement, @{$v1}[0, 2, 3], @{$v2}[0, 2, 3];

        }
    }
}

__DATA__
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530

输出

87 overlap with 93     93: 1 82      87: 1 7912   
76 overlap with 93     93: 1 82      76: 2 20690  
65 overlap with 93     93: 1 82      65: 2 170    
76 overlap with 87     87: 1 7912    76: 2 20690  
65 overlap with 87     87: 1 7912    65: 2 170    
65 overlap with 76     76: 2 20690   65: 2 170    
256 overlap with 76    76: 2 20690   256: 17515 66740  
228 overlap with 166   166: 72503 123150  228: 72510 114530

【讨论】：

该文件有数万行，所以我认为 O(n!) 不会完成
我不确定你是如何计算出它是 O(n!)。我还以为是O(n*n)。甚至不测试就拒绝一个简单的解决方案是非常错误的，因为你“认为[它]不会完成”