哈希计算频率可以改进吗？答案

【问题标题】：Hashing to Calculate Frequencies can be improved?哈希计算频率可以改进吗？
【发布时间】：2012-04-21 17:50:53
【问题描述】：

我目前正在构建一个哈希表以计算频率，具体取决于数据结构的运行时间。 O(1) 插入，O(n) 更糟糕的查找时间等。

我问了几个人std::map 和哈希表之间的区别，我得到的答案是：

"std::map 将元素添加为二叉树，因此导致 O(log n) 使用哈希表 您实现 它将是 O(n)."

因此，我决定使用 链表数组（用于单独链接） 结构来实现哈希表。在下面的代码中，我为节点分配了两个值，一个是 key(the word)，另一个是 value(frequency)。它的作用是；添加第一个节点时，如果index为空，则直接作为链表的第一个元素插入，频率为0。如果它已经在列表中（不幸的是，搜索需要 O(n) 时间）将其频率增加 1。如果没有找到，只需将其添加到列表的开头。

我知道在实现中有很多流程，所以想问问这里有经验的人，为了有效地计算频率，如何改进这个实现？

到目前为止我写的代码；

#include <iostream>
#include <stdio.h>

using namespace std;

struct Node {
    string word;
    int frequency;
    Node *next;
};

class linkedList
{
private:
    friend class hashTable;
    Node *firstPtr;
    Node *lastPtr;
    int size;
public:
    linkedList()
    {
        firstPtr=lastPtr=NULL;
        size=0;
    }
    void insert(string word,int frequency)
    {
        Node* newNode=new Node;
        newNode->word=word;
        newNode->frequency=frequency;

        if(firstPtr==NULL)
            firstPtr=lastPtr=newNode;
        else {
            newNode->next=firstPtr;
            firstPtr=newNode;
        }

        size++;
    }
    int sizeOfList()
    {
        return size;
    }
    void print()
    {
        if(firstPtr!=NULL)
        {
            Node *temp=firstPtr;
            while(temp!=NULL)
            {
                cout<<temp->word<<" "<<temp->frequency<<endl;
                temp=temp->next;
            }
        }
        else
            printf("%s","List is empty");
    }
};

class hashTable
{
private:
    linkedList* arr;
    int index,sizeOfTable;
public:
    hashTable(int size) //Forced initalizer
    {
        sizeOfTable=size;
        arr=new linkedList[sizeOfTable];
    }
    int hash(string key)
    {
        int hashVal=0;

        for(int i=0;i<key.length();i++)
            hashVal=37*hashVal+key[i];

        hashVal=hashVal%sizeOfTable;
        if(hashVal<0)
            hashVal+=sizeOfTable;

        return hashVal;
    }
    void insert(string key)
    {
        index=hash(key);
        if(arr[index].sizeOfList()<1)
            arr[index].insert(key, 0);
        else {
            //Search for the index throughout the linked list.
            //If found, increment its value +1
            //else if not found, add the node to the beginning
        }
    }



};

【问题讨论】：

#include <tr1/unordered_map> 如果你使用的是 c++03，#include <unordered_map 如果你使用的是 c++11...
@dionadar 'unordered_map' 是否支持碰撞？
如果您需要多地图而不是地图，请考虑使用<unordered_multimap :)
链表数组？太慢了。使用数组数组。是的，std::unordered_map 确实支持冲突。
@KonradRudolph 非常感谢您的启发性回答。但是考虑到 k 个最常见的词，“unordered_map”是一个合理的选择吗？ trie 结构之前有人提出过，但我认为对于这类问题实施起来太复杂了。

标签： c++ performance algorithm data-structures hashtable

【解决方案1】：

你关心最坏的情况吗？如果不是，请使用std::unordered_map（它处理冲突并且您不需要multimap）或trie/critbit 树（取决于键，它可能比散列更紧凑，这可能会导致更好的缓存行为）。如果是，请使用 std::set 或 trie。

如果您想要，例如在线 top-k 统计数据，请保留一个优先队列除了字典。每个字典值都包含出现次数以及单词是否属于队列。队列复制了前 k 个频率/单词对，但按频率键控。每当您扫描另一个单词时，请检查它是否 (1) 尚未在队列中，以及 (2) 是否比队列中的最少元素更频繁。如果是这样，请提取最少的队列元素并插入您刚刚扫描的那个。

如果愿意，您可以实现自己的数据结构，但从事 STL 实现工作的程序员往往非常敏锐。我会确保这是瓶颈所在。

【讨论】：

【解决方案2】：

1- 在 std::map 和 std::set 中搜索的复杂时间为 O(log(n))。而且，std::unordered_map 和 std::unordered_set 的摊销时间复杂度为 O(n)。但是，散列的常数时间可能非常大，并且对于小数字变得超过 log(n)。我总是考虑这张脸。

2- 如果你想使用std::unordered_map，你需要确保为你的类型定义了std::hash。否则你应该定义它。

【讨论】：