根据给定分布对数据帧进行采样答案

【问题标题】：Sampling a dataframe based on a given distribution根据给定分布对数据帧进行采样
【发布时间】：2016-01-10 21:06:00
【问题描述】：

如何根据给定的类\标签分布值对 pandas 数据框或 graphlab sframe 进行采样，例如：我想对具有标签\类列的数据框进行采样以选择行，以便平等地获取每个类标签，从而获得每个类标签的相似频率对应于类标签的均匀分布。或者最好是根据我们想要的类分布来获取样本。

+-----+-------+--------+ | col1 |克洛尔2 |班级 | +-----+-------+--------+ | 4 | 45 |一个 | +-----+-------+--------+ | 5 | 66 |乙| +-----+-------+--------+ | 5 | 6 | C | +-----+-------+--------+ | 4 | 6 | C | +-----+-------+--------+ | 321 | 1 |一个 | +-----+-------+--------+ | 32 |第432章乙| +-----+-------+--------+ | 5 | 3 |乙| +-----+-------+--------+ 给定一个像上面这样的巨大数据框和如下所需的频率分布： +-------+--------------+ |班级 |发味提取物| +-------+--------------+ |一个 | 2 | +-------+--------------+ |乙| 2 | +-------+--------------+ | C | 2 | +-------+--------------+

以上内容应根据第二帧中的给定频率分布从第一个数据帧中提取行，其中频率计数值在 nostoextract 列中给出，以给出每个类最多出现 2 次的采样帧。如果找不到足够的类来满足所需的计数，则应忽略并继续。生成的数据框将用于基于决策树的分类器。

正如评论员所说，采样数据帧必须包含 nostoextract 相应类的不同实例？除非给定类没有足够的示例，在这种情况下，您只需获取所有可用的示例。

【问题讨论】：

你能添加一些你想要实现的例子吗？你看过pandas.DataFrame.sample吗？ (pandas.pydata.org/pandas-docs/stable/generated/…)
@chris-sc 是的，它不允许基于类列进行采样
基本上我想对一个倾斜的数据框进行采样，以便尽可能充分地表示所有类标签。类标签位于“标签”列中。这被馈送到分类器。 @chris-sc
我想你想要 StratifiedKFold 这返回迭代器，为每个类标签保留数据的统一拆分
抱歉，您能否发布示例代码和所需的输出，因为我不太了解您想要的内容

标签： python pandas graphlab sframe

【解决方案1】：

我认为这会解决你的问题：

import pandas as pd

data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
                     'clol2':[45, 66, 6, 6, 1, 432, 3],
                     'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = pd.DataFrame({'class':['A', 'B', 'C'],
                     'nostoextract':[2, 2, 2], })

def bootstrap(data, freq):
    freq = freq.set_index('class')

    # This function will be applied on each group of instances of the same
    # class in `data`.
    def sampleClass(classgroup):
        cls = classgroup['class'].iloc[0]
        nDesired = freq.nostoextract[cls]
        nRows = len(classgroup)

        nSamples = min(nRows, nDesired)
        return classgroup.sample(nSamples)

    samples = data.groupby('class').apply(sampleClass)

    # If you want a new index with ascending values
    # samples.index = range(len(samples))

    # If you want an index which is equal to the row in `data` where the sample
    # came from
    samples.index = samples.index.get_level_values(1)

    # If you don't change it then you'll have a multiindex with level 0
    # being the class and level 1 being the row in `data` where
    # the sample came from.

    return samples

print(bootstrap(data,freq))

打印：

  class  clol2  cols1
0     A     45      4
4     A      1    321
1     B     66      5
5     B    432     32
3     C      6      4
2     C      6      5

如果你不希望结果按类排序，你可以在最后permute它。

【讨论】：

谢谢 sframe 也可以这样吗？（图形实验室
@stackit，不知道......他们似乎有相同的界面。你试过了吗？

【解决方案2】：

您能否将您的第一个数据帧拆分为特定于类的子数据帧，然后从中随意采样？

即

dfa = df[df['class']=='A']
dfb = df[df['class']=='B']
dfc = df[df['class']=='C']
....

然后在 dfa、dfb、dfc 上拆分/创建/过滤后，根据需要从顶部选择一个数字（如果数据框没有任何特定的排序模式）

 dfasamplefive = dfa[:5]

或者使用之前评论者描述的抽样方法，直接随机抽样：

dfasamplefive = dfa.sample(n=5)

如果这符合您的需要，剩下要做的就是自动化该过程，输入要从您拥有的控制数据帧中采样的数量作为包含所需样本数量的第二个数据帧。

【讨论】：

是的，你说的很对，谢谢！ [相应编辑]

【解决方案3】：

这是 SFrame 的解决方案。这不是您想要的完全，因为它随机采样点，因此结果不一定具有您指定的精确行数。一个确切的方法可能会随机打乱数据，然后为给定的类获取第一行 k ，但这会让你非常接近。

import random
import graphlab as gl

## Construct data.
sf = gl.SFrame({'col1': [4, 5, 5, 4, 321, 32, 5],
                'col2': [45, 66, 6, 6, 1, 432, 3],
                'class': ['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = gl.SFrame({'class': ['A', 'B', 'C'],
                  'number': [3, 1, 0]})

## Count how many instances of each class and compute a sampling
#  probability.
grp = sf.groupby('class', gl.aggregate.COUNT)
freq = freq.join(grp, on ='class', how='left')
freq['prob'] = freq.apply(lambda x: float(x['number']) / x['Count'])

## Join the sampling probability back to the original data.
sf = sf.join(freq[['class', 'prob']], on='class', how='left')

## Sample the original data, then subset.
sf['sample_mask'] = sf.apply(lambda x: 1 if random.random() <= x['prob'] 
                             else 0)
sf2 = sf[sf['sample_mask'] == 1]

在我的样本运行中，我碰巧得到了我指定的样本的确切数量，但同样，此解决方案无法保证这一点。

>>> sf2
+-------+------+------+
| class | col1 | col2 |
+-------+------+------+
|   A   |  4   |  45  |
|   A   | 321  |  1   |
|   B   |  32  | 432  |
+-------+------+------+

【讨论】：