可以在 Perl 数据结构中保留大量数据吗答案

【问题标题】：Is it okay to keep huge data in Perl data structure可以在 Perl 数据结构中保留大量数据吗
【发布时间】：2017-05-26 15:22:33
【问题描述】：

我从客户那里收到了一些 CSV。这些 CSV 文件的平均大小为 20 MB。

格式为：

Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info

我目前的做法：我把所有这些记录暂时存到一个表中，然后在表中查询：

where customer='customer1' and product='product1'
where customer='customer1' and product='product2'
where customer='customer2' and product='product1'

问题：插入数据库然后选择需要太多时间。很多事情正在发生，处理一个 CSV 需要 10-12 分钟。我目前正在使用 SQLite，它非常快。但我想如果我完全删除插入和选择，我会节省更多时间。

我想知道是否可以将这个完整的 CSV 存储在一些复杂的 perl 数据结构中？

机器一般有 500MB+ 可用内存。

【问题讨论】：

你需要做多少查询？在这里使用 SQLite 是一种很好的方法，因为您也可以让它为您索引内容。如果始终是同一个查询，在 Perl 数据结构中构建自己的索引很容易，但如果有多个索引，则需要更多内存。但是如果你不索引，你的搜索会很慢。 DBI 有一个 CSV 驱动程序，但也可能很慢。
每个 CSV 有 10k 个查询。很多事情正在发生，处理一个 CSV 需要 10-12 分钟。我目前正在使用 SQLite，它非常快。但我想如果我完全删除插入和选择，我会节省更多时间。从未听说过“Perl 数据结构中的索引”。能给个链接吗？
你误解了我的意思。您需要的是一个搜索索引。我想说的是，您需要制作自己的此类搜索索引并将其放入 Perl 数据结构中。看我的回答。但最重要的问题是你做了多少不同的查询？。
对于数据库的批量插入，您通常希望使用从 CSV 文件插入的功能。但是如果你之后摆脱了数据库，你最好完全避免使用数据库，而只使用散列的散列。
@simbabque，SQLite 似乎没有执行此操作的 SQL 命令，但来自 CSV 的using the sqlite3 tool to create the initial database 应该比使用 Perl 更快。

标签： perl

【解决方案1】：

如果您显示的查询是您想要执行的唯一一种查询，那么这很简单。

my $orders; # I guess
while (my $row = <DATA> ) {
    chomp $row;
    my @fields = split /,/, $row;

    push @{ $orders->{$fields[0]}->{$fields[1]} } \@fields; # or as a hashref, but that's larger
}

print join "\n", @{ $orders->{Cutomer1}->{Product1}->[0] }; # typo in cuStomer

__DATA__
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info

您只需将索引构建到几个级别深的哈希引用中。第一层有客户。它包含另一个 hashref，其中包含与该索引匹配的行列表。然后，您可以决定是否只想将整个事物作为数组引用，或者是否要将带有键的哈希引用放在那里。我选择了一个数组 ref，因为它消耗的内存更少。

稍后您可以轻松查询它。我在上面包括了。这是输出。

Cutomer1
Product1
cat1
many
other
info

如果您不想记住索引但必须编写许多不同的查询，您可以创建表示magic numbers 的变量（甚至是常量）。

use constant {
    CUSTOMER => 0,
    PRODUCT  => 1,
    CATEGORY => 2,
    MANY     => 3,
    OTHER    => 4,
    INFO     => 5,
};

# build $orders ...

my $res = $orders->{Cutomer1}->{Product2}->[0];

print "Category: " . $res->[CATEGORY];

输出是：

Category: cat2

要对结果排序，您可以使用 Perl 的 sort。如果您需要按两列排序，SO 上有解释如何做到这一点的答案。

for my $res ( 
    sort { $a->[OTHER] cmp $b->[OTHER] } 
    @{ $orders->{Customer2}->{Product1} } 
) {
    # do stuff with $res ...
}

但是，您只能像这样按客户和产品进行搜索。

如果有不止一种类型的查询，这会变得很昂贵。如果还只按类别对它们进行分组，则每次查找时都必须迭代所有它们，或者构建第二个索引。这样做比多等几秒钟更难，所以你可能不想这样做。

我想知道是否可以将这个完整的 CSV 存储在一些复杂的 perl 数据结构中？

绝对是为了这个特定的目的。 20 MB 并不是很多。

我用这段代码创建了一个 20004881 字节和 447848 行的测试文件，这并不完美，但可以完成工作。

use strict;
use warnings;
use feature 'say';
use File::stat;

open my $fh, '>', 'test.csv' or die $!;
while ( stat('test.csv')->size < 20_000_000 ) {
    my $customer = 'Customer' . int rand 10_000;
    my $product  = 'Product' . int rand 500;
    my $category = 'cat' . int rand 7;
    say $fh join ',', $customer, $product, $category, qw(many other info);
}

这是文件的摘录：

$ head -n 20 test.csv
Customer2339,Product176,cat0,many,other,info
Customer2611,Product330,cat2,many,other,info
Customer1346,Product422,cat4,many,other,info
Customer1586,Product109,cat5,many,other,info
Customer1891,Product96,cat5,many,other,info
Customer5338,Product34,cat6,many,other,info
Customer4325,Product467,cat6,many,other,info
Customer4192,Product239,cat0,many,other,info
Customer6179,Product373,cat2,many,other,info
Customer5180,Product302,cat3,many,other,info
Customer8613,Product218,cat1,many,other,info
Customer5196,Product71,cat5,many,other,info
Customer1663,Product393,cat4,many,other,info
Customer6578,Product336,cat0,many,other,info
Customer7616,Product136,cat4,many,other,info
Customer8804,Product279,cat5,many,other,info
Customer5731,Product339,cat6,many,other,info
Customer6865,Product317,cat2,many,other,info
Customer3278,Product137,cat5,many,other,info
Customer582,Product263,cat6,many,other,info

现在让我们用这个输入文件运行上面的程序，看看内存消耗和数据结构大小的一些统计数据。

use strict;
use warnings;
use Devel::Size 'total_size';

use constant {
    CUSTOMER => 0,
    PRODUCT  => 1,
    CATEGORY => 2,
    MANY     => 3,
    OTHER    => 4,
    INFO     => 5,
};

open my $fh, '<', 'test.csv' or die $!;

my $orders;
while ( my $row = <$fh> ) {
    chomp $row;
    my @fields = split /,/, $row;

    $orders->{ $fields[0] }->{ $fields[1] } = \@fields;
}

say 'total size of $orders: ' . total_size($orders);

这里是：

total size of $orders: 185470864

所以该变量消耗 185 兆字节。这比 20MB 的 CSV 文件要多得多，但我们有一个易于搜索的索引。使用 htop 我发现实际进程消耗了 287MB。我的机器有16G内存，所以我不在乎。运行这个程序大约需要 3.6 秒，但我有一个 SSD 和一个新的 CORE i7 机器。

但是如果你有 500MB 的空闲空间，它不会吃掉你所有的内存。 SQLite 方法可能会消耗更少的内存，但您必须对比 SQLite 方法与 SQLite 方法的速度来确定哪个更胖。

我使用方法described in this answer 将文件读入SQLite 数据库¹。我需要先在文件中添加一个标题行，但这很简单。

$ sqlite3 test.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .mode csv test
sqlite> .import test.csv test

由于我无法正确测量，假设它感觉像大约 2 秒。然后我为特定查询添加了一个索引。

sqlite> CREATE INDEX foo ON test ( customer, product );

这感觉就像又花了一秒钟。现在我可以查询了。

sqlite> SELECT * FROM test WHERE customer='Customer23' AND product='Product1';
Customer23,Product1,cat2,many,other,info

结果立即出现（这不科学！）。由于我们没有测量从 Perl 数据结构中检索需要多长时间，因此我们无法比较它们，但感觉这一切都需要大约相同的时间。

但是，SQLite 文件大小只有 38839296，也就是 39MB 左右。这比 CSV 文件大，但不是很多。似乎 sqlite3 进程只消耗大约 30kB 的内存，考虑到索引，我觉得这很奇怪。

总之，SQLite 似乎更方便一点，而且占用的内存更少。在 Perl 中这样做并没有什么问题，而且速度可能相同，但是使用 SQL 进行这种类型的查询感觉更自然，所以我会这样做。

如果我这么大胆的话，我会假设你在 SQLite 中设置索引时没有在表上设置索引，这使得它需要更长的时间。即使对于 SQLite，我们这里的行数也不是很多。正确索引它是小菜一碟。

如果您实际上不知道索引的作用，请考虑电话簿。它在页面的两侧有第一个字母的索引。要找到 John Doe，你抓住 D，然后以某种方式查看。现在想象没有这样的事情。你需要随机地四处寻找更多。然后尝试找到电话号码为 123-555-1234 的人。如果没有索引，这就是您的数据库所做的。

^{1) 如果您想编写脚本，您也可以通过管道或将命令读入sqlite3 实用程序以创建数据库，然后使用 Perl 的 DBI 进行查询。例如，sqlite3 foo.db <<<'.tables\ .tables'（其中反斜杠 \ 表示文字换行符）会打印两次表列表，因此像这样导入也可以。}

【讨论】：

嘿，我喜欢你的回答，但我的主要问题没有得到解答：我想知道是否可以将这个完整的 CSV 存储在一些复杂的 perl 数据结构中？。我从未创建过他大小的数据结构。
@GrSrv 哦。没错，我没有回答这个问题。好吧，只要您的机器没有崩溃，就可以做任何您想做的事情。你应该问的问题是这样做是否有用。 ;) 我会在答案中添加更多内容。
@GrSrv 我已完成更新答案。有一个基准，一些关于该方法的可行性的思考以及与 SQLite 的比较。玩得开心。 :)
哇！这简直太棒了。非常感谢你:-)。我在 sqlite 中使用索引。我不得不问这个问题，因为目前我每天可以处理大约 150 个 CSV，并且我正在寻找可以优化以提高速度的区域。在您的示例中，您直接将 CSV 导入到 SQLite，这对我来说不是一个选项。我必须使用 CSV 模块并逐行插入 SQLite，这需要相当长的时间。
@GrSrv 为什么会这样？你是在 Perl 中用 DBI 做的吗？如果这使它更快，那么我会说，炮轰和调用sqlite3 工具没有任何问题。只需记录您正确执行此操作的原因。如果你真的不能并且想要 DBI，那么应该有比逐行更快的方法。例如，每execute 以块为单位进行 1000 次。或者通过交易一次完成所有事情。在任何情况下，先放入索引，这样它就会同时建立起来，总体上应该比我在测试中那样稍后添加它要快。