看完下面两篇文章:
回忆一下Hash表的概念、构造方法和查找效率。
概念
顺序查找、折半查找、二叉排序树查找和B-树查找,查找的效率依赖于查找过程中比较的次数。理想的情况是不经过任何比较,直接定位要找的元素。定位是根据给定的Key找到记录存储位置的映射。我们一般称这种映射关系为hash函数。按照这个思想建立的表叫hash表。
好的hash函数的标准?简单和均匀。简单,指hash函数简单,计算速度快。均匀,指分布均匀,冲突少。
Hash函数的构造方法有:直接定址法,数字分析法,平方取中法,除留余数法,随机数法。(见《数据结构》严蔚敏)
由于Hash函数是一个压缩映像,不可避免的会产生冲突。所以设计Hash表的时候还要设计一种处理冲突的办法。
处理冲突的方法有:开放定址法,再Hash法,链地址法,公共溢出区。(见《数据结构》严蔚敏)
C#的Dictionary
C#中的Dictionary的hash函数算法是什么?还是用老赵文章中的代码片段,下面这段HashTable代码注释:
/*
Implementation Notes:
The generic Dictionary was copied from Hashtable's source - any bug
fixes here probably need to be made to the generic Dictionary as well.
This Hashtable uses double hashing. There are hashsize buckets in the
table, and each bucket can contain 0 or 1 element. We a bit to mark
whether there's been a collision when we inserted multiple elements
(ie, an inserted item was hashed at least a second time and we probed
this bucket, but it was already in use). Using the collision bit, we
can terminate lookups & removes for elements that aren't in the hash
table more quickly. We steal the most significant bit from the hash code
to store the collision bit.
Our hash function is of the following form:
h(key, n) = h1(key) + n*h2(key)
where n is the number of times we've hit a collided bucket and rehashed
(on this particular lookup). Here are our hash functions:
h1(key) = GetHash(key); // default implementation calls key.GetHashCode();
h2(key) = 1 + (((h1(key) >> 5) + 1) % (hashsize - 1));
The h1 can return any number. h2 must return a number between 1 and
hashsize - 1 that is relatively prime to hashsize (not a problem if
hashsize is prime). (Knuth's Art of Computer Programming, Vol. 3, p. 528-9)
If this is true, then we are guaranteed to visit every bucket in exactly
hashsize probes, since the least common multiple of hashsize and h2(key)
will be hashsize * h2(key). (This is the first number where adding h2 to
h1 mod hashsize will be 0 and we will search the same bucket twice).
We previously used a different h2(key, n) that was not constant. That is a
horrifically bad idea, unless you can prove that series will never produce
any identical numbers that overlap when you mod them by hashsize, for all
subranges from i to i+hashsize, for all i. It's not worth investigating,
since there was no clear benefit from using that hash function, and it was
broken.
For efficiency reasons, we've implemented this by storing h1 and h2 in a
temporary, and setting a variable called seed equal to h1. We do a probe,
and if we collided, we simply add h2 to seed each time through the loop.
A good test for h2() is to subclass Hashtable, provide your own implementation
of GetHash() that returns a constant, then add many items to the hash table.
Make sure Count equals the number of items you inserted.
Note that when we remove an item from the hash table, we set the key
equal to buckets, if there was a collision in this bucket. Otherwise
we'd either wipe out the collision bit, or we'd still have an item in
the hash table.
--
*/