C ++如何在使用哈希函数时计算冲突次数？答案

【问题标题】：C++ How to count number of collisions while using a hash function?C ++如何在使用哈希函数时计算冲突次数？
【发布时间】：2017-04-09 15:52:04
【问题描述】：

我被分配到这个实验室，我需要在其中创建一个散列函数，并计算在散列多达 30000 个元素的文件时发生的冲突次数。到目前为止，这是我的代码

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

long hashcode(string s){
  long seed = 31; 
  long hash = 0;
  for(int i = 0; i < s.length(); i++){
    hash = (hash * seed) + s[i];
  }
  return hash % 10007;
};

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];

  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    count++;
    for(int i = 0; i<count; i++){
        if(array[i]==hashcode(x)){
            collisions++;
        }
    }
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

我只是不确定如何计算碰撞次数。我尝试将每个散列值存储到一个数组中，然后搜索该数组，但是当只有 10000 个元素时，它会导致大约 12000 次冲突。任何关于如何计算冲突或者即使我的哈希函数可以使用改进的建议，都将不胜感激。谢谢。

【问题讨论】：

while (!eof()) is wrong.
@chris 这是我的教授为我们预编的代码
请参阅stackoverflow.com/questions/8317508/hash-function-for-a-string 了解字符串的哈希函数。通常使用散列来索引散列表。所以你的逻辑有点奇怪。
@RichardChambers 他的帖子是我用来构建哈希函数的帖子之一，我的教授不希望将它们放入哈希表中，他只想将它们散列并计算碰撞次数跨度>
这并不能改变它不可靠的事实。如果输入失败，则无限循环。还要考虑案例where the input ends with whitespace。如您所见，这也会导致不良行为。

标签： c++ hash

【解决方案1】：

问题是您要重新计算碰撞次数（假设您的列表中有 4 个相同的元素，没有其他任何内容，然后通过您的算法查看您要计算的碰撞次数）

相反，创建一组哈希码，每次计算哈希码时，检查它是否在集合中。如果它在集合中，则增加碰撞总数。如果不在集合中，请将其添加到集合中。

编辑：

为了快速修补您的算法，我执行了以下操作：在循环后递增计数，并在发现冲突后退出 for 循环。这仍然不是超级高效，因为我们正在循环遍历所有结果（使用集合数据结构会更快），但这至少应该是正确的。

还对其进行了调整，因此我们不会一遍又一遍地计算 hashcode(x)：

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];

  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    for(int i = 0; i<count; i++){
        if(array[i]==array[count]){
            collisions++;
            // Once we've found one collision, we don't want to count all of them.
            break;
        }
    }
    // We don't want to check our hashcode against the value we just added
    // so we should only increment count here.
    count++;
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

【讨论】：

如果我要将它们全部添加到一个数组中，我将如何在 while 循环之后搜索每个特定的哈希码？
我使用术语集而不是数组（它们在编程中具有不同的含义），我原本打算在 while 循环内完成查找。现在忽略所有这些：我已经编辑了我的解决方案来修补您的尝试，而不是提供最好的解决方案
我尝试使用 break 并导致 10001 次碰撞，但后来我看到并更改了它，因此它在循环后增加计数，现在我只得到 2093 次碰撞，这要好得多。我教授的样本输出导致大约 4700 次冲突，我看到我的哈希函数有时会发出负数，这是允许的吗？我想我会尝试研究设置的数据结构，因为使用这种方法运行程序确实需要相当长的时间，不过非常感谢。
如果你想在程序的当前形式中加快速度，避免在内部循环中重复计算 hashcode(x)。您可以将其设置为变量并仅计算一次。
哇，再次感谢你，这让它快了 10 倍

【解决方案2】：

为了教育的利益而添加的答案。这可能是你教授的下一课。

几乎可以肯定，检测哈希冲突的最有效方法是使用哈希集（a.k.a. unordered_set）

#include <iostream>
#include <unordered_set>
#include <fstream>
#include <string>

// your hash algorithm
long hashcode(std::string const &s) {
    long seed = 31;
    long hash = 0;
    for (int i = 0; i < s.length(); i++) {
        hash = (hash * seed) + s[i];
    }
    return hash % 10007;
};

int main(int argc, char **argv) {
    std::ifstream is{argv[1]};
    std::unordered_set<long> seen_before;
    seen_before.reserve(10007);
    std::string buffer;
    int collisions = 0, count = 0;
    while (is >> buffer) {
        ++count;
        auto hash = hashcode(buffer);
        auto i = seen_before.find(hash);
        if (i == seen_before.end()) {
            seen_before.emplace_hint(i, hash);
        }
        else {
            ++collisions;
        }
    }
    std::cout << "Total Input is " << count << std::endl;
    std::cout << "Collision # is " << collisions << std::endl;
}

【讨论】：

用哈希结果填充数组会更快，然后对其进行排序，然后通过排序后的数组计算重复元素的数量。
@Andrey 为什么？这似乎并不比这更快
顺便说一句@Richard，我们可以只使用 emplace 而不是 emplace_hint 并查看返回值中返回的 bool 来决定是否增加计数器？ (cplusplus.com/reference/unordered_set/unordered_set/emplace) （另外我想了解你为什么这样做，我对 C++ 知之甚少，所以如果我的建议有缺点，我很想知道）
@AndreyTurkin 填充数组是 O(N)，排序是 O(N.logN)，迭代排序数组是 O(N)。插入查询 unordered_set 是常数时间，而 emplace_hint 是最坏情况下的常数时间。
@muzzlator 是的，那行得通。我只是不喜欢插入的语义。

【解决方案3】：

关于哈希表的解释见How does a hash table work?

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

// Generate a hash code that is in the range of our hash table.
// The range we are using is zero to 10,007 so that our table is
// large enough and the prime number size reduces the probability
// of collisions from different strings hashing to the same value.
unsigned long hashcode(string s){
    unsigned long seed = 31;
    unsigned long hash = 0;
    for (int i = 0; i < s.length(); i++){
        hash = (hash * seed) + s[i];
    }
    // we want to generate a hash code that is the size of our table.
    // so we mod the calculated hash to ensure that it is in the proper range
    // of our hash table entries. 10007 is a prime number which provides
    // better characteristics than a non-prime number table size.
    return hash % 10007; 
};

int main(int argc, char * argv[]){
    int count = 0;
    int collisions = 0;
    fstream input(argv[1]);
    string x;
    int array[30000] = { 0 };

    //File stream
    while (!input.eof()){
        input >> x;     // get the next string to hash
        count++;        // count the number of strings hashed.
        // hash the string and use the hash as an index into our hash table.
        // the hash table is only used to keep a count of how many times a particular
        // hash has been generated. So the table entries are ints that start with zero.
        // If the value is greater than zero then we have a collision.
        // So we use postfix increment to check the existing value while incrementing
        // the hash table entry.
        if ((array[hashcode(x)]++) > 0)
            collisions++;
    }
    cout << "Total Input is " << count << endl;
    cout << "Collision # is " << collisions << endl;

    return 0;
}

【讨论】：