【发布时间】:2022-01-24 22:46:53
【问题描述】:
鉴于类的分布和这些类的示例行的数据框,是否有一种简单/快速的方法可以从数据框中采样与给定分布匹配的分布,其中没有足够示例的类会减少其他类中的示例数量:
例如
+------+-------+-------+
| col1 | col2 | class |
+------+-------+-------+
| 4 | 45 | A |
+------+-------+-------+
| 5 | 66 | B |
+------+-------+-------+
| 5 | 6 | C |
+------+-------+-------+
| 4 | 6 | A |
+------+-------+-------+
| 321 | 1 | A |
+------+-------+-------+
| 32 | 432 | A |
+------+-------+-------+
| 5 | 3 | B |
+------+-------+-------+
given a dataframe like above and the distribution like below:
+-------+--------------+
| class | proportion |
+-------+--------------+
| A | 0.50 |
+-------+--------------+
| B | 0.25 |
+-------+--------------+
| C | 0.25 |
+-------+--------------+
I would like to return something like:
+------+-------+-------+
| col1 | col2 | class |
+------+-------+-------+
| 5 | 66 | B |
+------+-------+-------+
| 5 | 6 | C |
+------+-------+-------+
| 4 | 6 | A |
+------+-------+-------+
| 32 | 432 | A |
+------+-------+-------+
【问题讨论】: