.net 使用什么哈希算法？爪哇呢？答案

【问题标题】：What hash algorithm does .net utilise? What about java?.net 使用什么哈希算法？爪哇呢？
【发布时间】：2009-05-21 18:43:53
【问题描述】：

关于 HashTable（以及此类的后续衍生物），有人知道 .net 和 Java 使用什么散列算法吗？

List 和 Dictionary 都是 Hashtable 的直接倒数吗？

【问题讨论】：

标签： c# java hashtable

【解决方案1】：

哈希函数没有内置到哈希表中；散列表调用键对象上的方法来计算散列。因此，哈希函数会根据键对象的类型而有所不同。

在 Java 中，List 不是哈希表（也就是说，它不扩展 Map 接口）。可以在内部实现带有哈希表的List（稀疏列表，其中列表索引是哈希表的键），但这样的实现不是标准 Java 库的一部分。

【讨论】：

虽然 Java 的哈希表确实包含一个 SECONDARY 哈希函数，但当然是为了防止随机性不在低位的哈希函数。

【解决方案2】：

我对 .NET 一无所知，但我会尝试为 Java 代言。

在 Java 中，哈希码最终是给定对象的 hashCode() 函数返回的代码和 HashMap/ConcurrentHashMap 类中的二级哈希函数的组合（有趣的是，两者使用不同的函数）。请注意，Hashtable 和 Dictionary（HashMap 和 AbstractMap 的前身）是过时的类。列表实际上只是“其他东西”。

例如，String 类通过将当前代码重复乘以 31 并添加下一个字符来构造哈希代码。有关更多信息，请参阅我在 how the String hash function works 上的文章。数字一般使用“自己”作为哈希码；其他类，例如具有字段组合的矩形通常使用字符串技术的组合，即乘以一个小的素数并添加，但添加各种字段值。（选择质数意味着您不太可能在某些值和哈希码宽度之间获得“意外交互”，因为它们不会除以任何东西。）

由于哈希表的大小——即它拥有的“桶”的数量——是 2 的幂，所以桶号是从哈希码中得出的，基本上是通过去掉高位直到哈希码在范围内.辅助散列函数通过“散布位”来防止所有或大部分随机性都在那些高位中的散列函数，这样一些随机性最终会出现在低位中并且不会被截断。如果没有这种混合，String 哈希码实际上可以很好地工作，但是用户创建的哈希码可能不会很好地工作。请注意，如果两个不同的哈希码解析为相同的桶号，Java 的 HashMap 实现使用“链接”技术——即它们在每个桶中创建一个条目的链接列表。因此，哈希码具有良好的随机性非常重要，这样项目就不会聚集到特定范围的存储桶中。（然而，即使有一个完美的散列函数，你仍然会根据平均定律预计会发生一些链接。）

哈希码的实现不应该是个谜。您可以查看您选择的任何类的 hashCode() 源代码。

【讨论】：

您应该补充一点，哈希冲突是通过具有相同哈希码的条目的链接列表来处理的。
IIRC，HashMap 现在的大小是 2 的幂，但以前不是。 Hashtable 传统上以素数开始。如果您的哈希代码不好（并且您不重新哈希），则素数大小（仅）很有用。
Tom——是的，确实如此，事实上我记得这已经延续到 1.4 之前的 HashMap 版本。但正如你所说，许多人认为，如果你发现你需要一个素数的桶，这基本上是上帝告诉你哈希函数不够好的方式。
Michael -- 我将添加一些内容来提及这一点。注：正是同一个 bucket index 导致了链接。如果两个对象的 hashCode() 返回相同的值，则其中一个将覆盖另一个。

【解决方案3】：

HASHING 算法是用于确定 HashTable 中某个项目的哈希码的算法。

HASHTABLE 算法（我认为这就是这个人所要求的）是 HashTable 在给定哈希码的情况下用来组织其元素的算法。

Java 恰好使用了chained 哈希表算法。

【讨论】：

用于 HashMap 和 Hashtable。 IdentityHashMap（和 ThreadLocal）使用探测算法。（尽管我认为没有指定任何内容。）

【解决方案4】：

在自己寻找相同答案时，我在 .net 的参考源 @http://referencesource.microsoft.com 中找到了这个。

     /*
      Implementation Notes:
      The generic Dictionary was copied from Hashtable's source - any bug 
      fixes here probably need to be made to the generic Dictionary as well.
      This Hashtable uses double hashing.  There are hashsize buckets in the 
      table, and each bucket can contain 0 or 1 element.  We a bit to mark
      whether there's been a collision when we inserted multiple elements
      (ie, an inserted item was hashed at least a second time and we probed 
      this bucket, but it was already in use).  Using the collision bit, we
      can terminate lookups & removes for elements that aren't in the hash
      table more quickly.  We steal the most significant bit from the hash code
      to store the collision bit.

      Our hash function is of the following form:

      h(key, n) = h1(key) + n*h2(key)

      where n is the number of times we've hit a collided bucket and rehashed
      (on this particular lookup).  Here are our hash functions:

      h1(key) = GetHash(key);  // default implementation calls key.GetHashCode();
      h2(key) = 1 + (((h1(key) >> 5) + 1) % (hashsize - 1));

      The h1 can return any number.  h2 must return a number between 1 and
      hashsize - 1 that is relatively prime to hashsize (not a problem if 
      hashsize is prime).  (Knuth's Art of Computer Programming, Vol. 3, p. 528-9)
      If this is true, then we are guaranteed to visit every bucket in exactly
      hashsize probes, since the least common multiple of hashsize and h2(key)
      will be hashsize * h2(key).  (This is the first number where adding h2 to
      h1 mod hashsize will be 0 and we will search the same bucket twice).

      We previously used a different h2(key, n) that was not constant.  That is a 
      horrifically bad idea, unless you can prove that series will never produce
      any identical numbers that overlap when you mod them by hashsize, for all
      subranges from i to i+hashsize, for all i.  It's not worth investigating,
      since there was no clear benefit from using that hash function, and it was
      broken.

      For efficiency reasons, we've implemented this by storing h1 and h2 in a 
      temporary, and setting a variable called seed equal to h1.  We do a probe,
      and if we collided, we simply add h2 to seed each time through the loop.

      A good test for h2() is to subclass Hashtable, provide your own implementation
      of GetHash() that returns a constant, then add many items to the hash table.
      Make sure Count equals the number of items you inserted.

      Note that when we remove an item from the hash table, we set the key
      equal to buckets, if there was a collision in this bucket.  Otherwise
      we'd either wipe out the collision bit, or we'd still have an item in
      the hash table.

       -- 
    */

【讨论】：

这里详细解释了内部原理，无需猜测。
具体链接在这里：referencesource.microsoft.com/#mscorlib/system/collections/…

【解决方案5】：

.NET 中任何声称是 HashTable 或类似的东西都不会实现自己的散列算法：它们总是调用被散列对象的 GetHashCode() 方法。

关于这个方法做什么或应该做什么，有很多困惑，特别是在涉及覆盖base Object implementation的用户定义或其他自定义类时。

【讨论】：

【解决方案6】：

对于 .NET，您可以使用 Reflector 查看各种算法。通用和非通用哈希表是不同的，当然每个类都定义了自己的哈希码公式。

【讨论】：

【解决方案7】：

.NET Dictionary<T> 类使用 IEqualityComparer<T> 计算键的哈希码并在键之间进行比较以进行哈希查找。如果您在构造 Dictionary<T> 实例时未提供 IEqualityComparer<T>（它是构造函数的可选参数），它将为您创建一个默认值，默认情况下使用 object.GetHashCode 和 object.Equals 方法。

至于标准GetHashCode 实现的工作原理，我不确定它是否已记录在案。对于特定类型，您可以阅读 Reflector 中方法的源代码或尝试检查 Rotor 源代码以查看它是否存在。

【讨论】：