【发布时间】:2020-09-29 05:14:09
【问题描述】:
对于机器学习项目,我想将我的数据拆分为训练集和测试集,以保持特定组的比例在各组之间保持一致。我创建了一个 40 行的虚拟 data.frame 来解释自己。在这里,对于“地区”组,20% 的数据是“北美”,50% 是“欧洲”,20% 是亚洲,10% 是大洋洲。我想得到一个随机子集,例如整个数据的 25% ,其中“区域”组的百分比组成保持不变。
换句话说,我想从这个开始:
City County Region
1 Shangai China Asia
2 Tokyo Japan Asia
3 Osaka Japan Asia
4 Hanoi Vietnam Asia
5 Beijing China Asia
6 Sapporo Japan Asia
7 Tottori Japan Asia
8 Saigon Vietnam Asia
9 Rome Italy Europe
10 Paris France Europe
11 Lisbon Portugal Europe
12 Berlin Germany Europe
13 Madrid Spain Europe
14 Vienna Austria Europe
15 Naples Italy Europe
16 Nice France Europe
17 Porto Portugal Europe
18 Frankfurt Germany Europe
19 Sevilla Spain Europe
20 Salzburg Austria Europe
21 Barcelona Spain Europe
22 Amsterdam Netherlands Europe
23 Bern Switzerland Europe
24 Milan Italy Europe
25 San Sebastian Spain Europe
26 Rotterdam Netherlands Europe
27 Zurich Switzerland Europe
28 Turin Italy Europe
29 Ney York City US North America
30 Toronto Canada North America
31 Mexico City Mexico North America
32 Atlanta US North America
33 Chicago US North America
34 Atlanta US North America
35 Vancouver Canada North America
36 Guadalajara Mexico North America
37 Sydney Australia Oceania
38 Wellington New Zealand Oceania
39 Melbourne Australia Oceania
40 Auckland New Zealand Oceania
并以此结束(随机选择行对我很重要):
City County Region
1 New York US North America
2 Mexico City Mexico North America
3 Amsterdam Netherlands Europe
4 Madrid Spain Europe
5 Lisbon Portugal Europe
6 Rome Italy Europe
7 Paris France Europe
8 Tokyo Japan Asia
9 Osaka Japan Asia
10 Wellington New Zealand Oceania
【问题讨论】:
标签: r machine-learning train-test-split