重新排序基于 C++ 映射的集合的有效方法答案

【问题标题】：Efficient way to re-order a C++ map-based collection重新排序基于 C++ 映射的集合的有效方法
【发布时间】：2012-06-10 10:17:16
【问题描述】：

我有一个大型（ish - >100K）集合，将用户标识符（一个 int）映射到他们购买的不同产品的数量（也是一个 int）。我需要像重新组织数据一样高效可以找出有多少用户拥有不同数量的产品。例如，有多少用户拥有一种产品，有多少用户拥有两种产品等。

我通过将原始数据从 std::map 反转为 std::multimap 来实现这一点（其中键和值只是颠倒了。）然后我可以挑选出拥有 N的用户数量> 使用 count(N) 的产品（尽管我也将值唯一地存储在一个集合中，因此我可以确定我正在迭代的值的确切数量及其顺序）

代码如下：

// uc is a std::map<int, int> containing the  original
// mapping of user identifier to the count of different
// products that they've bought.
std::set<int> uniqueCounts;
std::multimap<int, int> cu; // This maps count to user.

for ( map<int, int>::const_iterator it = uc.begin();
        it != uc.end();  ++it )
{
    cu.insert( std::pair<int, int>( it->second, it->first ) );
    uniqueCounts.insert( it->second );
}

// Now write this out
for ( std::set<int>::const_iterator it = uniqueCounts.begin();
        it != uniqueCounts.end();  ++it )
{
    std::cout << "==> There are "
            << cu.count( *it ) << " users that have bought "
            << *it << " products(s)" << std::endl;
}

我不禁觉得这不是最有效的方法。有人知道这样做的聪明方法吗？

我的限制是 我不能使用 Boost 或 C++11 来做到这一点。

哦，还有，如果有人想知道，这既不是作业，也不是面试问题。

【问题讨论】：

标签： c++ algorithm stl map

【解决方案1】：

假设您知道单个用户可以购买的产品的最大数量，您可能会发现仅使用向量来存储操作结果的性能会更好。实际上，您将需要为原始地图中的几乎每个条目进行分配，这可能不是最快的选择。

它还可以减少映射上的查找开销，获得内存局部性的好处，并用向量的恒定时间查找替换对多映射的计数调用（这不是恒定时间操作）。

所以你可以这样做：

std::vector< int > uniqueCounts( MAX_PRODUCTS_PER_USER );

for ( map<int, int>::const_iterator it = uc.begin();
        it != uc.end();  ++it )
{
    uniqueCounts[ uc.second ]++;
}

// Now write this out
for ( int i = 0, std::vector< int >::const_iterator it = uniqueCounts.begin();
        it != uniqueCounts.end();  ++it, ++i )
{
    std::cout << "==> There are "
            << *it << " users that have bought "
            << i << " products(s)" << std::endl;
}

即使您不知道产品的最大数量，您似乎也可以猜测一个最大值，并根据需要调整此代码以增加向量的大小。无论如何，它肯定会导致比原始示例更少的分配。

所有这一切都是假设您在处理完这些数据之后实际上并不需要用户 ID（正如下面的 cmets 所指出的，为每个用户购买的产品数量相对较少 &连续集。否则，您最好使用地图代替矢量 - 您仍然可以避免调用 multimap::count 函数，但可能会失去其他一些好处）

【讨论】：

"如果需要，调整此代码以增加向量的大小" - 最简单的就是一行，if (uc.second >= uniqueCounts.size()) uniqueCounts.resize(uc.second+1);。如果某些计数对于向量来说太大（购买了数亿件产品的用户？），请考虑使用像 map 这样的稀疏容器来代替 vector。
我想这归结为我是否需要多映射中的中间数据（即映射计数到用户 ID）我不确定我现在是否需要，但如果不需要，这似乎是好方法。
@Component10 如果您真的热衷于拥有这些数据，则可以采用某种混合方法 - 具有用户 ID 的多图和用于计数的向量。尽管这可能完全是矫枉过正（而且可能效率也不高）。

【解决方案2】：

这取决于您所说的“更高效”是什么意思。首先，这真的是瓶颈吗？当然，100k 个条目很多，但如果您只需要每隔几分钟执行一次，那么算法需要几秒钟就可以了。

我看到的唯一需要改进的地方是内存使用。如果这是一个问题，你可以跳过多图的生成，只保留一个计数器图，像这样（注意，我的 C++ 有点生疏）：

std::map<int, int> countFrequency; // count => how many customers with that count

for ( std::map<int, int>::const_iterator it = uc.begin();
        it != uc.end();  ++it )
{
    // If it->second is not yet in countFrequency, 
    // the default constructor initializes it to 0.
    countFrequency[it->second] += 1;
}

// Now write this out
for ( std::map<int, int>::const_iterator it = countFrequency.begin();
        it != countFrequency.end();  ++it )
{
    std::cout << "==> There are "
            << it->second << " users that have bought "
            << it->first << " products(s)" << std::endl;
}

如果添加了用户并购买了count 商品，您可以将countFrequency 更新为

countFrequency[count] += 1;

如果现有用户从 oldCount 转到 newCount 项目，您可以更新 countFrequency 为

countFrequency[oldCount] -= 1;
countFrequency[newCount] += 1;

现在，顺便说一句，我建议使用 unsigned int 进行计数（除非有合理的理由进行负计数）并定义 userID 类型以增加可读性。

【讨论】：

是的，这在很大程度上取决于客户是否会要求按用户细分的产品条带。这并不是真正的瓶颈——有很多数据库工作要慢得多——但只是感觉它效率不高。我同意你关于 typedef'ing 等的观点。代码是一个简化的示例，它必须混淆客户端的专有代码，这就是为什么我只选择简单的整数。

【解决方案3】：

如果可以，我建议始终保持这两条数据都是最新的。换句话说，我会维护第二张地图，将购买的产品数量映射到购买那么多产品的客户数量。如果您维护它，此地图包含您问题的确切答案。每次客户购买产品时，设 n 为该客户现在购买的产品数量。从键 n-1 的值中减去 1。将键 n 处的值加一。如果键的范围足够小，这可能是一个数组而不是一个映射。您是否期望一个客户购买数百种产品？

【讨论】：

这是一个公平的观点。将两个集合封装在一个管理同步的对象中将是一种有用的方法。该过程实际上是一次性的批处理作业，产品计数功能是客户的新要求，这就是为什么它不是从头开始设计的原因。希望这能提供一些背景信息。

【解决方案4】：

只是为了好玩，这里有一种混合方法，如果数据很小，则使用 vector，并使用 map 来涵盖一个用户购买了真正荒谬数量的产品的情况。我怀疑您是否真的需要在商店应用中使用后者，但更一般的问题版本可能会从中受益。

typedef std::map<int, int> Map;
typedef Map::const_iterator It;

template <typename Container>
void get_counts(const Map &source, Container &dest) {
    for (It it = source.begin(); it != source.end(); ++it) {
        ++dest[it->second];
    }
}

template <typename Container>
void print_counts(Container &people, int max_count) {
    for (int i = 0; i <= max_count; ++i) {
        if contains(people, i) {
            std::cout << "==> There are "
                << people[i] << " users that have bought "
                << i << " products(s)" << std::endl;
        }
    }
}


// As an alternative to this overloaded contains(), you could write
// an overloaded print_counts -- after all the one above is not an 
// efficient way to iterate a sparsely-populated map. 
// Or you might prefer a template function that visits
// each entry in the container, calling a specified functor to
// will print the output, and passing it the key and value.
// This is just the smallest point of customization I thought of.
bool contains(const Map &c, int key) {
    return c.count(key);
}
bool contains(const std::vector<int, int> &c, int key) {
    // also check 0 < key < c.size() for a more general-purpose function
    return c[key]; 
}

void do_everything(const Map &uc) {
    // first get the max product count
    int max_count = 0;
    for (It it = uc.begin(); it != uc.end(); ++it) {
        max_count = max(max_count, it->second);
    }

    if (max_count > uc.size()) { // or some other threshold
        Map counts;
        get_counts(uc, counts);
        print_counts(counts, max_count);
    } else {
        std::vector<int> counts(max_count+1);
        get_counts(uc, counts);
        print_counts(counts, max_count);
    }
}

从这里你可以重构，创建一个类模板CountReOrderer，它接受一个模板参数告诉它是使用vector还是map作为计数。

【讨论】：

谢谢。我认为他们不太可能想要高于可以在向量中管理的连续数量（尽管如果他们的用户购买了数百万种产品，他们会被淘汰！）也感谢强调可扩展性问题我没有解决这个问题：不假设我的初始（输入）映射的大小可能是无限的也许是明智的，尽管我承认我现在还不会为它编写代码！