查找给定数字组中数字的频率答案

【问题标题】：Finding Frequency of numbers in a given group of numbers查找给定数字组中数字的频率
【发布时间】：2010-09-13 19:39:39
【问题描述】：

假设我们在 C++ 中有一个向量/数组，我们希望计算这 N 个元素中哪些元素的重复出现次数最多，并输出最高次数。哪种算法最适合这项工作。

示例：

int a = { 2, 456, 34, 3456, 2, 435, 2, 456, 2}

输出为 4，因为 2 出现了 4 次。这是 2 出现的最大次数。

【问题讨论】：

我正在使用 STL 映射来填充频率，并使用 sort(map.begin(),map.end()) 对其进行排序是否有更多的速度增益？
如果问题是“哪个数字”，答案应该是 2 而不是 4 ;-)。
这听起来像是一道作业题。
速度不是功课问题！如果您仔细考虑，它更多的是关于竞争
应该是 "int a[] = ..." 吗？

标签： c++ c algorithm puzzle frequency

【解决方案1】：

对数组进行排序，然后快速计算每个数字。该算法的复杂度为 O(N*logN)。

或者，创建一个哈希表，使用数字作为键。在哈希表中为您键入的每个元素存储一个计数器。您将能够一次计算所有元素；但是，算法的复杂性现在取决于您的 hasing 函数的复杂性。

【讨论】：

是的，我就是这么想的
呃，是的。现在是凌晨 3 点，我有一个三周大的婴儿，如果这算作借口的话。 :-)
没有必要找借口 - 毕竟，SO 是一个协作的努力 :)
因为一旦你排序，你不需要为每个数字外部计数器，你可以只为当前数字保留一个计数器，为目前最高计数的数字保留一个。
好的，我假设数组足够小，毕竟它在内存中开始。

【解决方案2】：

空间优化：

快速排序（例如）然后迭代项目，仅跟踪最大计数。充其量是 O(N log N)。

针对速度进行了优化：

遍历所有元素，跟踪单独的计数。这个算法总是 O(n)。

【讨论】：

如果排序，只需要保留一个数的最长序列的长度即可。如果不排序，则必须将所有数字的计数保存在关联容器中。
如果您跟踪每个元素的计数，最坏的情况将需要 N 个计数器。你需要的内存几乎翻了一番。对于一台 4GB 内存的机器来说，这不会是什么大问题。但是，对于与操作系统共享的 64K 内存，您可能需要排序。
@Franci Penov：重点是 - 问题说“最好”，答案取决于“最好”的感觉
是的，我同意。这就是为什么我提供了两种替代解决方案 - 排序或计数器哈希表。 :-) 只是想指出第二种算法的内存消耗缺点。内存也很重要，不仅仅是速度。
“优化速度”版本的更大问题不是你需要一个大小等于最大可能数字的数组来保持 O(n) 吗？否则你需要 O(n*log n) 的树或 O(谁知道) 的哈希？

【解决方案3】：

如果您有 RAM 并且您的值不是太大，请使用 counting sort。

【讨论】：

【解决方案4】：

使用 STL 的 C++ 实现可能是：

#include <iostream>
#include <algorithm>
#include <map>

// functor
struct maxoccur
{
    int _M_val;
    int _M_rep;

    maxoccur()
    : _M_val(0),
      _M_rep(0)
    {}

    void operator()(const std::pair<int,int> &e)
    {
        std::cout << "pair: " << e.first << " " << e.second << std::endl;
        if ( _M_rep < e.second ) {
            _M_val = e.first;
            _M_rep = e.second;
        }
    }
};

int
main(int argc, char *argv[])
{
    int a[] = {2,456,34,3456,2,435,2,456,2};
    std::map<int,int> m; 

    // load the map
    for(unsigned int i=0; i< sizeof(a)/sizeof(a[0]); i++) 
        m [a[i]]++;

    // find the max occurence...
    maxoccur ret = std::for_each(m.begin(), m.end(), maxoccur());
    std::cout << "value:" << ret._M_val << " max repetition:" << ret._M_rep <<  std::endl;

    return 0;
}

【讨论】：

【解决方案5】：

一点伪代码：

//split string into array firts
strsplit(numbers) //PHP function name to split a string into it's components
i=0
while( i < count(array))
 {
   if(isset(list[array[i]]))
    {
      list[array[i]]['count'] = list + 1
    }
   else
    {
      list[i]['count'] = 1
      list[i]['number']
    }
   i=i+1
 }
usort(list) //usort is a php function that sorts an array by its value not its key, Im assuming that you have something in c++ that does this
print list[0]['number'] //Should contain the most used number

【讨论】：

【解决方案6】：

哈希算法（build count[i] = #occurrences(i) in basic linear time）非常实用，但理论上不是严格的O(n)，因为在这个过程中可能会发生哈希冲突。

这个问题的一个有趣的特殊情况是多数算法，如果存在任何这样的元素，您想在其中找到至少存在于 n/2 个数组条目中的元素。

这里有一个quick explanation 和一个more detailed explanation，说明如何在线性时间内做到这一点，没有任何散列技巧。

【讨论】：

【解决方案7】：

如果元素的范围与元素的数量相比很大，我会像其他人所说的那样，只是排序和扫描。这是时间 n*log n 并且没有额外的空间（可能是额外的 log n）。

计数排序的问题在于，如果值的范围很大，初始化计数数组可能比排序花费更多的时间。

【讨论】：

【解决方案8】：

这是我完整的、经过测试的版本，使用 std::tr1::unordered_map。

我使这个大约为 O(n)。首先它遍历n个输入值以插入/更新unordered_map中的计数，然后它执行partial_sort_copy，即O(n)。 2*O(n) ~= O(n)。

#include <unordered_map>
#include <vector>
#include <algorithm>
#include <iostream>

namespace {
// Only used in most_frequent but can't be a local class because of the member template
struct second_greater {
    // Need to compare two (slightly) different types of pairs
    template <typename PairA, typename PairB>
    bool operator() (const PairA& a, const PairB& b) const
        { return a.second > b.second; }
};
}

template <typename Iter>
std::pair<typename std::iterator_traits<Iter>::value_type, unsigned int>
most_frequent(Iter begin, Iter end)
{
    typedef typename std::iterator_traits<Iter>::value_type value_type;
    typedef std::pair<value_type, unsigned int> result_type;

    std::tr1::unordered_map<value_type, unsigned int> counts;

    for(; begin != end; ++begin)
        // This is safe because new entries in the map are defined to be initialized to 0 for
        // built-in numeric types - no need to initialize them first
        ++ counts[*begin];

    // Only need the top one at this point (could easily expand to top-n)
    std::vector<result_type> top(1);

    std::partial_sort_copy(counts.begin(), counts.end(),
                           top.begin(), top.end(), second_greater());

    return top.front();
}

int main(int argc, char* argv[])
{
    int a[] = { 2, 456, 34, 3456, 2, 435, 2, 456, 2 };

    std::pair<int, unsigned int> m = most_frequent(a, a + (sizeof(a) / sizeof(a[0])));

    std::cout << "most common = " << m.first << " (" << m.second << " instances)" << std::endl;
    assert(m.first == 2);
    assert(m.second == 4);

    return 0;
}

【讨论】：

【解决方案9】：

它将在 O(n)............ 但问题是大号。的数组可以取另一个相同大小的数组............

for(i=0;i

mar=count[o]; 索引=o;

for(i=0;i

那么输出将是.........元素 index 出现在 max 没有。这个数组中的次数............

这里 a[] 是我们需要搜索某个数字的最大出现次数的数据数组。在一个数组中......

count[] 具有每个元素的计数............ 注意：我们已经知道数据范围将在数组中。比如说。该数组中的数据范围从 1 到 100.......然后有 100 个元素的计数数组来跟踪，如果它发生，则将索引值增加一............

【讨论】：