哈希中的 4/16 是多少？答案

【问题标题】：What is 4/16 in hashes?哈希中的 4/16 是多少？
【发布时间】：2015-07-28 06:19:46
【问题描述】：

if (%hash){
     print "That was a true value!\n";
}
当（且仅当）哈希至少有一个核心价值对。

实际结果是一个内部调试字符串，对人们维护 Perl 的人。 看起来像“4/16”，但价值当哈希为非空时保证为真，当哈希为非空时保证为假它是空的。 --骆驼书

今年 4 月 16 日是什么时候？谁能给我看一个小程序，从中我可以看到结果是 4/16？

【问题讨论】：

我认为“4/16”只是分数的一个例子。这本书应该说过“它是分数，例如 4/16 或 1/8。”

标签： perl

【解决方案1】：

来自perldoc perldata：

如果您在标量上下文中评估哈希，如果哈希返回 false 是空的。如果有任何键/值对，则返回 true；更多的准确地说，返回的值是一个字符串，由已使用的桶和分配的桶数，用 a 分隔削减。这仅在确定 Perl 的内部散列算法在您的数据集上表现不佳。为了例如，您将 10,000 个东西放入哈希中，但在标量上下文显示 "1/16" ，这意味着只有十六分之一 buckets 已被触摸，大概包含了你所有的 10,000 个项目。

所以，4/16 将是使用/分配的存储桶计数，类似以下内容将显示此值：

%hash = (1, 2);
print scalar(%hash); #prints 1/8 here

【讨论】：

要查找 has 中的条目数，请使用 scalar( keys %hash )。

【解决方案2】：

散列是一个链表数组。散列函数将键转换为一个数字，该数字用作存储值的数组元素（“桶”）的索引。多个键可以散列到同一个索引（“冲突”），这种情况由链表处理。

分数的分母是桶的总数。

分数的分子是包含一个或多个元素的桶数。

对于元素个数相同的哈希，数字越大越好。返回 6/8 的冲突比返回 4/8 的冲突少。

【讨论】：

这比我写一个糟糕的哈希表实现来演示发生了什么要简洁得多。太棒了。
注意少碰撞句是错误的。这个评论空间太小了，所以我提供了一个更好更准确的答案。
@rurban，您是否错过了我的资格“对于具有相同数量元素的哈希”？其中一个散列具有与另一个散列相同数量的元素的两个更少的桶。返回 6/8 的那个必须至少比返回 4/8 的那个多两次碰撞。

【解决方案3】：

这是我发送到 Perl Beginners 邮件列表的电子邮件的略微修改版本，回答了同样的问题。

说

my $hash_info = %hash;

将为您提供0（如果哈希为空）或用于的比率总桶数。这些信息几乎，但不完全，对你没用。要了解这意味着什么，您必须首先了解散列的工作原理。

让我们使用 Perl 5 实现一个哈希。我们需要的第一件事是哈希函数。散列函数将字符串转换成，希望，唯一编号。真正的强散列函数的例子是 MD5 或 SHA1，但它们对于普通使用来说往往太慢了，所以人们倾向于使用较弱的（即产生较少独特输出的那些）哈希表的函数。 Perl 5 使用 Bob Jenkins [一次一个] 算法，它在唯一性和速度之间取得了很好的折衷。对于我们的例如，我将使用一个非常弱的散列函数：

#!/usr/bin/perl

use strict;
use warnings;

sub weak_hash {
       my $key  = shift;
       my $hash = 1;
       #multiply every character in the string's ASCII/Unicode value together
       for my $character (split //, $key) {
               $hash *= ord $character;
       }
       return $hash;
}

for my $string (qw/cat dog hat/) {
       print "$string hashes to ", weak_hash($string), "\n";
}

由于散列函数返回的数字范围大于我们想要的范围，因此您通常使用modulo 来缩小它给出的数字范围返回：

#!/usr/bin/perl

use strict;
use warnings;

sub weak_hash {
       my $key  = shift;
       my $hash = 1;
       #multiply every character in the string's ASCII/Unicode value together
       for my $character (split //, $key) {
               $hash *= ord $character;
       }
       return $hash;
}

for my $string (qw/cat dog hat/) {
       # the % operator is constraining the number
       # weak_hash returns to 0 - 10
       print "$string hashes to ", weak_hash($string) % 11, "\n";
}

现在我们有了一个散列函数，我们需要在某个地方保存密钥和价值。这称为哈希表。哈希表通常是其元素称为存储桶的数组（这些存储桶比率正在谈论）。一个桶将保存所有的键/值散列到相同数字的对：

#!/usr/bin/perl

use strict;
use warnings;

sub weak_hash {
       my $key  = shift;
       my $hash = 1;
       for my $character (split //, $key) {
               $hash *= ord $character;
       }
       return $hash;
}

sub create {
       my ($size) = @_;

       my @hash_table;

       #set the size of the array
       $#hash_table = $size - 1;

       return \@hash_table;
}


sub store {
       my ($hash_table, $key, $value) = @_;

       #create an index into $hash_table
       #constrain it to the size of the hash_table
       my $hash_table_size = @$hash_table;
       my $index           = weak_hash($key) % $hash_table_size;

       #push the key/value pair onto the bucket at the index
       push @{$hash_table->[$index]}, {
               key   => $key,
               value => $value
       };

       return $value;
}

sub retrieve {
       my ($hash_table, $key) = @_;

       #create an index into $hash_table
       #constrain it to the size of the hash_table
       my $hash_table_size = @$hash_table;
       my $index           = weak_hash($key) % $hash_table_size;

       #get the bucket for this key/value pair
       my $bucket = $hash_table->[$index];

       #find the key/value pair in the bucket
       for my $pair (@$bucket) {
               return $pair->{value} if $pair->{key} eq $key;
       }

       #if key isn't in the bucket:
       return undef;
}

sub list_keys {
       my ($hash_table) = @_;

       my @keys;

       for my $bucket (@$hash_table) {
               for my $pair (@$bucket) {
                       push @keys, $pair->{key};
               }
       }

       return @keys;
}

sub print_hash_table {
       my ($hash_table) = @_;

       for my $i (0 .. $#$hash_table) {
               print "in bucket $i:\n";
               for my $pair (@{$hash_table->[$i]}) {
                       print "$pair->{key} => $pair->{value}\n";
               }
       }
}

my $hash_table = create(3);

my $i = 0;
for my $key (qw/a b c d g j/) {
       store($hash_table, $key, $i++);
}
print_hash_table($hash_table);

print "the a key holds: ", retrieve($hash_table, "a"), "\n";

从这个例子中我们可以看出，一个桶有可能有比其他更多的键/值对。这是一个糟糕的情况 in. 它会导致该存储桶的哈希变慢。这是其中之一使用哈希返回的已用桶与总桶的比率标量上下文。如果哈希表明只有几个桶正在用过，但它们是散列中的很多键，那么你知道你有一个问题。

要了解有关哈希的更多信息，请在此处就我所说的内容提出问题，或read about them。

【讨论】：

【解决方案4】：

添加另一个答案，因为第一个答案已经太长了。

查看"4/16" 含义的另一种方法是使用Hash::Esoteric 模块（警告alpha 质量代码）。我写它是为了让我更好地了解散列内部发生的事情，这样我就可以尝试理解大散列似乎具有的performance problem。来自Hash::Esoteric 的keys_by_bucket 函数将返回散列中的所有键，但不是像keys 那样将它们作为列表返回，而是将它们作为AoA 返回，其中顶层代表桶和arrayref里面存放着那个桶里的钥匙。

#!/user/bin/env perl

use strict;
use warnings;

use Hash::Esoteric qw/keys_by_bucket/;

my %hash = map { $_ => undef } "a" .. "g";
my $buckets = keys_by_bucket \%hash;

my $used;
for my $i (0 .. $#$buckets) {
    if (@{$buckets->[$i]}) {
        $used++;
    }
    print "bucket $i\n";
    for my $key (@{$buckets->[$i]}) {
        print "\t$key\n";
    }
}

print "scalar %hash: ", scalar %hash, "\n",
      "used/total buckets: $used/", scalar @$buckets, "\n";

上面的代码打印出类似这样的内容（实际数据取决于 Perl 的版本）：

bucket 0
    e
bucket 1
    c
bucket 2
    a
bucket 3
    g
    b
bucket 4
bucket 5
    d
bucket 6
    f
bucket 7
scalar %hash: 6/8
used/total buckets: 6/8

【讨论】：

是的，我没看懂HvMAX返回的是什么（我以为是桶的个数，其实是最后一个桶的索引）。 GitHub 上的版本现在应该是正确的。

【解决方案5】：

分数是散列的填充率：已使用的存储桶与已分配的存储桶。有时也称为负载系数。

要真正获得“4/16”，您需要一些技巧。 4 个键将导致 8 个桶。因此，您至少需要 9 个键，然后删除 5 个。

$ perl -le'%h=(0..16); print scalar %h; delete $h{$_} for 0..8; print scalar %h'
9/16
4/16

请注意，您的数字会有所不同，因为种子是随机的，您无法预测确切的碰撞

填充率是重新散列时的关键散列信息。 Perl 5 以 100% 的填充率重新散列，请参阅 hv.c 中的 DO_HSPLIT 宏。因此，它以内存换取只读速度。正常的填充率将在 80%-95% 之间。你总是留下洞来避免一些碰撞。较低的填充率导致更快的访问（更少的冲突），但更多的重新哈希。

您不会立即看到与分数发生冲突的次数。您还需要keys %hash，以便与分数的分子、已使用的桶数进行比较。

因此，碰撞质量的一部分是keys / used buckets：

my ($used, $max) = split '/',scalar(%hash);
keys %hash / $used;

但实际上您需要知道桶中所有链表的长度总和。您可以通过Hash::Util::bucket_info访问此质量

($keys, $buckets, $used, @length_count)= Hash::Util::bucket_info(\%hash)

虽然哈希访问通常是 O(1)，但对于长长度，它只有 O(n/2)，尤其是。对于过长的水桶。在https://github.com/rurban/perl-hash-stats，我为 perl5 核心测试套件数据的各种散列函数提供了碰撞质量的统计信息。我还没有测试不同填充率的权衡，因为我正在完全重写当前的哈希表。

更新：对于 perl5，比 100% 更好的填充率将是 90%，正如最近测试的那样。但这取决于使用的哈希函数。我用了一个又快又坏的：FNV1A。使用更好、更慢的哈希函数，您可以使用更高的填充率。当前默认的 OOAT_HARD 很糟糕而且很慢，所以应该避免。

【讨论】：

【解决方案6】：

(%hash) 在标量上下文中评估哈希。

这是一个空哈希：

command_line_prompt> perl -le '%hash=(); print scalar %hash;'

结果为 0。

这是一个非空哈希：

command_line_prompt> perl -le '%hash=(foo=>'bar'); print scalar %hash;'

结果是字符串“1/8”。

【讨论】：

这不是答案。您实际上只是在重申问题。