Python：使用频率数据对总体进行下采样答案

【问题标题】：Python: downsample a population using frequency dataPython：使用频率数据对总体进行下采样
【发布时间】：2013-06-13 13:42:36
【问题描述】：

给定一个代表总体中元素频率的数据系列，对其进行下采样的最简单方法是什么？

以下人群：pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']

可以概括为：freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}

使用简单：from collections import Counter; Counter(pop)

要随机将该群体抽样到 5 个人，我可以这样做：

>>> from random import sample
>>> from collections import Counter
>>> pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 3, 'c': 1, 'b': 1})

但我正在寻找一种方法来直接从freq 信息中执行此操作，而无需构建pop 列表。您将同意不需要这样的程序：

>>> from random import sample
>>> from collections import Counter
>>> flatten = lambda x: [item for sublist in x for item in sublist]
>>> freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
>>> pop = flatten([[k]*v for k,v in freq.items()])
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 2, 'c': 2, 'd': 1})

出于内存考虑和速度要求，我想避免将pop 列表放入内存。这肯定可以使用某种类型的加权随机生成器来完成。

【问题讨论】：

您对近似解感兴趣吗？由于您对下采样感兴趣，因此freq 中的值可能非常大。如果是这样，您可以通过某种因素将它们全部覆盖（//），然后枚举较小的freq dict，并对其进行抽样。
我正在寻找一个精确的算法。

标签： python random sampling downsampling

【解决方案1】：

这是一个对频率进行下采样的基本算法：

import random
import bisect
import collections

def downsample(freq, n):
    cumsums = []
    total = 0
    choices, weights = zip(*freq.items())
    for weight in weights:
        total += weight
        cumsums.append(total)
    assert 0 <= n <= total
    result = collections.Counter()
    for _ in range(n):
        rnd = random.uniform(0, total)
        i = bisect.bisect(cumsums, rnd)
        result[choices[i]] += 1
        cumsums = [c if idx<i else c-1 for idx, c in enumerate(cumsums)]
        total -= 1
    return result

freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
print(downsample(freq, 5))

打印类似的结果

Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})

【讨论】：

就是这样，使用二分法。现在我看到这似乎是合乎逻辑的。谢谢。