如何以分层方式拆分数据集（Matlab）？答案

【问题标题】：How to split the dataset in a stratified way (Matlab)?如何以分层方式拆分数据集（Matlab）？
【发布时间】：2020-11-19 17:15:17
【问题描述】：

我有一个 30x500 的数据矩阵和一个 3x500 的目标矩阵。这是一个分类问题。我需要将数据分为训练、验证和测试（80%、10%、10%），但我想保持每个类在划分数据中的比例。如何在 Matlab 中做到这一点？

编辑：目标矩阵包含正确类别的标签（一个热点）（共有三个类别）

|0 0 1 ... 1|
|1 0 0 ... 0|
|0 1 0 ... 0|3x500

数据矩阵包含 500 个样本和 30 个预测变量 (30x500)。

|2 0 1 4 8 1 ... 2|
|4 1 5 8 7 3 ... 0|
|1 3 6 4 2 1 ... 6|
|. . . . . . . . .|
|3 5 8 4 0 0. .. 1| 30x500

【问题讨论】：

你能描述一下你叫什么类吗？目标矩阵究竟包含什么？
我编辑了问题并输入了该信息。目标矩阵是一个热矩阵，有0和1。存在三种可能的数据类别。
但是您是否希望数据在多个随机拆分中保持比例？或者您想对每个拆分强制执行比例调整？
我投了反对票，因为图片是文字。
感谢您纠正您的问题。我赞成补偿。

标签： matlab neural-network dataset

【解决方案1】：

您可以通过计算与累积概率相关的百分位数来设法确定每个类别的数量，然后将每个区间与相关区间相关联。让我们使用这个策略首先创建一个随机数据集，其中正好包含 50% 的 1 类、30% 的 2 类和 20% 的 3 类（并不是说你不必这样做，因为你已经有了这个类矩阵classmat和数据矩阵datamat)：

clc; clear all; close all;

% Parameters
c = 3; % number of classes
d = 500; % number of data
o = 30; % number of data for each observation
propd = [0.4, 0.2, 0.4]; % proportions of each class in the original data (size 1xc)

% Generation of fake data
datamat = randi([0,15],o,d); % test data matrix
propd_os = cumsum([0, propd]); propd_os(end) = 1;
randmat = rand(d,1);
classmat = zeros(c,d); % test class matrix
for i=1:c
    prctld = prctile(randmat, 100*propd_os); prctld(1) = 0; prctld(end) = 1;
    classmat(i,randmat>=prctld(i) & randmat<prctld(i+1)) = 1;
end

% Proportions of the original data
disp(['original data proportions:' sprintf(' %.3f',sum(classmat,2)/d)])

执行时，此代码创建classmat 矩阵并显示该矩阵中每个类的比例：

>>>
original data proportions: 0.400 0.200 0.400

我为您创建了一个脚本，用于将此数据集拆分为与原始数据集相同比例的部分：

%% Splitting parameters
s = 3; % number of parts
props = [0.8, 0.1, 0.1]; % proportions of each splitted datasets (size 1xs)
props_os = cumsum([0, props]); props_os(end) = 1;
randmat = rand(1,d);
splitmat = zeros(s,d); % split matrix (one hot)
for i=1:c
    indc = classmat(i,:)==1;
    prctls = prctile(randmat(indc), 100*props_os); prctls(1) = 0; prctls(end) = 1;
    for j=1:s
        inds = randmat>=prctls(j) & randmat<prctls(j+1);
        splitmat(j,indc&inds) = 1;
    end
end

% Proportions of classes in each split parts
disp(['original split proportions:' sprintf(' %.3f',sum(splitmat,2)/d)])
for j=1:s
    inds = splitmat(j,:)==1;
    disp([sprintf('part %d proportions:', j) sprintf(' %.3f',sum(classmat(:,inds),2)/sum(inds))])
end

您获得这 3 个部分，其中包含 80%、10% 和 10%。这些中的每一个与原始数据集的每个类的比例相同：

>>>
original split proportions: 0.800 0.100 0.100
part 1 proportions: 0.400 0.200 0.400
part 2 proportions: 0.400 0.200 0.400
part 3 proportions: 0.400 0.200 0.400

请注意，您不能总是获得确切的比例，因为它取决于数据集的大小以及它们被比例倒数的可分性......但我认为它应该做你想要的。如果您对代码有任何疑问，请随时提出问题。最后，您会获得一个热门的splitmat 矩阵。如果数据点d 属于部件s，则splitmat(s,d) 等于1。

【讨论】：