MATLAB：不使用现有函数的 10 倍交叉验证答案

【问题标题】：MATLAB: 10 fold cross Validation without using existing functionsMATLAB：不使用现有函数的 10 倍交叉验证
【发布时间】：2012-09-19 18:21:42
【问题描述】：

我有一个矩阵（我猜在 MatLab 中你称之为结构）或数据结构：

  data: [150x4 double]
labels: [150x1 double]

这是我的matrix.data 看起来假设我确实使用matrix 的名称加载了我的文件：

5.1000    3.5000    1.4000    0.2000
4.9000    3.0000    1.4000    0.2000
4.7000    3.2000    1.3000    0.2000
4.6000    3.1000    1.5000    0.2000
5.0000    3.6000    1.4000    0.2000
5.4000    3.9000    1.7000    0.4000
4.6000    3.4000    1.4000    0.3000
5.0000    3.4000    1.5000    0.2000
4.4000    2.9000    1.4000    0.2000
4.9000    3.1000    1.5000    0.1000
5.4000    3.7000    1.5000    0.2000
4.8000    3.4000    1.6000    0.2000
4.8000    3.0000    1.4000    0.1000
4.3000    3.0000    1.1000    0.1000
5.8000    4.0000    1.2000    0.2000
5.7000    4.4000    1.5000    0.4000
5.4000    3.9000    1.3000    0.4000
5.1000    3.5000    1.4000    0.3000
5.7000    3.8000    1.7000    0.3000
5.1000    3.8000    1.5000    0.3000

这是我的matrix.labels 看起来像

我正在尝试在不使用 MatLab 中任何现有函数的情况下创建 10 个交叉折叠验证，并且由于我对 MatLab 的了解非常有限，因此我无法继续使用现有的功能。任何帮助都会很棒。

这是我到目前为止所拥有的，我确信这可能不是 matlab 的方式，但我对 matlab 很陌生。

function[output] = fisher(dataFile, number_of_folds)
    data = load(dataFile);
    %create random permutation indx
    idx = randperm(150);
    output = data.data(idx(1:15),:);
end

【问题讨论】：

试试这个mathworks.com/help/bioinfo/ref/crossvalind.html
然后在你的矩阵上使用那些生成的索引
对不起，我想不使用crossvalind函数来做
哦，对不起！没有正确阅读问题。嗯一秒。
:) 没问题，如果您正在编写解决方案，任何 cmets 都会很棒，因为我正在尝试学习 matlab。

标签： matlab machine-learning cross-validation

【解决方案1】：

这是我对这个交叉验证的看法。我使用 magic(10) 创建虚拟数据，也随机创建标签。想法如下，我们获取数据和标签并将它们与随机列组合。考虑遵循虚拟代码。

>> data = magic(4)

data =

    16     2     3    13
     5    11    10     8
     9     7     6    12
     4    14    15     1

>> dataRowNumber = size(data,1)

dataRowNumber =

     4

>> randomColumn = rand(dataRowNumber,1)

randomColumn =

    0.8147
    0.9058
    0.1270
    0.9134


>> X = [ randomColumn data]

X =

    0.8147   16.0000    2.0000    3.0000   13.0000
    0.9058    5.0000   11.0000   10.0000    8.0000
    0.1270    9.0000    7.0000    6.0000   12.0000
    0.9134    4.0000   14.0000   15.0000    1.0000

如果我们根据第 1 列对 X 进行排序，我们会随机对数据进行排序。这将为我们提供交叉验证随机性。然后接下来就是根据交叉验证百分比来划分 X。在一个案例中完成这一点很容易。让我们考虑 %75% 是训练用例，%25% 是测试用例。我们这里的大小是 4，然后 3/4 = %75 和 1/4 是 %25。

testDataset = X(1,:)
trainDataset = X(2:4,:)

但是对于 N 个交叉折叠来说，实现这一点有点困难。因为我们需要做 N 次。为此需要for循环。对于 5 个交叉折叠。我得到了，在第一个 f

1st fold : 1 2 for test, 3:10 for train
第二折：3 4 用于测试，1 2 5:10 用于火车
第三折：5 6 用于测试，1:4 7:10 用于火车
4th fold : 7 8 for test, 1:6 9:10 for train
5th fold : 9 10 for test, 1:8 for train

以下代码是此过程的示例：

data = magic(10);
dataRowNumber = size(data,1);
labels= rand(dataRowNumber,1) > 0.5;
randomColumn = rand(dataRowNumber,1);

X = [ randomColumn data labels];


SortedData = sort(X,1);

crossValidationFolds = 5;
numberOfRowsPerFold = dataRowNumber / crossValidationFolds;

crossValidationTrainData = [];
crossValidationTestData = [];
for startOfRow = 1:numberOfRowsPerFold:dataRowNumber
    testRows = startOfRow:startOfRow+numberOfRowsPerFold-1;
    if (startOfRow == 1)
        trainRows = [max(testRows)+1:dataRowNumber];
        else
        trainRows = [1:startOfRow-1 max(testRows)+1:dataRowNumber];
    end
    crossValidationTrainData = [crossValidationTrainData ; SortedData(trainRows ,:)];
    crossValidationTestData = [crossValidationTestData ;SortedData(testRows ,:)];

end

【讨论】：

所以如果我有数据文件，我可以将它传递给 X..？
我如何使用随机排列来选择折叠..？
这很好用，但我设法把它放在一个函数中 [training_data, test_data] = diagFisher(dataFile, x) 然后最后分配 training_data = crossValidationTrainData; test_data = crossValidationTestData;但我如何访问这些？
你能解释一下我试图理解的 for 循环中的代码吗，我让它工作了，但我不太清楚，主要是因为我是 matlab 新手。
@Null-Hypothesis 我添加了更多解释。如果有任何不清楚的地方，请查看并添加评论。

【解决方案2】：

哈哈哈抱歉，没办法。我现在没有 MATLAB，所以无法检查代码是否有错误。但总体思路如下：

生成 k（在您的情况下为 10）个子样本
1. 从 1 开始两个计数器并预分配新矩阵：index = 1; subsample = 1; newmat = zeros("150","6")
2. 当您还有数据时：while ( length(labels) > 0 )
3. 在剩余数据量内生成一个随机数：randNum = randi(length(labels))?我认为这是一个随机整数，从 1 到您的标签数组的大小（它可能是 0，请检查文档 - 如果是，请做简单的数学运算使其 1
4. 将该行添加到带有标签的新数据集中：newmat(index,:) = [data(randNum,:) labels(randNum) subsample]
5. 从数据和标签中删除行：data(randNum,:) = []; same for labels 0 而不是 for 循环和简单索引
6. 增量计数器：index = index + 1; subsample = subsample + 1;
7. 如果子样本 = 11，则再次设为 1。

最后，您应该有一个大数据矩阵，看起来几乎与您的原始数据一模一样，但随机分配了“折叠标签”。

循环所有这些以及您的执行代码 k (10) 次。

编辑：以更易于访问的方式放置代码。注意它仍然是伪代码并且不完整！另外，您应该注意，这根本不是最有效的方法，但如果您不能使用 matlab 函数，也应该不会太糟糕。

for k = 1:10

index = 1; subsample = 1; newmat = zeros("150","6");
while ( length(labels) > 0 )
    randNum = randi(length(labels));
    newmat(index,:) = [data(randNum,:) labels(randNum) subsample];
    data(randNum,:) = []; same for labels
    index = index + 1; subsample = subsample + 1;
    if ( subsample == 11 )
        subsample = 1;
    end
end

% newmat is complete, now run code here using the sampled data 
%(ie pick a random number from 1:10 and use that as your validation fold. the rest for training

end

编辑答案 #2：

另一种方法是创建一个与您的数据集一样长的向量

foldLabels = zeros("150",1);

然后，循环那么长 (150)，为随机索引分配标签！

foldL = 1;
numAssigned = 0;
while ( numAssigned < 150 )
    idx = randi(150);
    % no need to reassign a given label, so check if is still 0
    if ( foldLabels(idx) == 0 )
        foldLabels(idx) = foldL;
        numAssigned++; % not matlab code, just got lazy. you get it
        foldL++;
        if ( foldL > 10 )
            foldL = 1;
        end
    end
end

编辑答案 #2.5

foldLabels = zeros("150",1);
for i = 1:150
    notChosenLabels = [notChosenLabels i];
end
foldL = 1;
numAssigned = 0;
while ( length(notChosenLabels) > 0 )
    labIdx = randi(length(notChosenLabels));
    idx = notChosenLabels(labIdx);
    foldLabels(idx) = foldL;
    numAssigned++; % not matlab code, just got lazy. you get it
    foldL++;
    if ( foldL > 10 )
        foldL = 1;
    end
    notChosenLabels(labIdx) = [];
end

为兰德珀姆编辑

使用 randperm 生成索引

idxs = randperm(150);

现在只分配

foldLabels = zeros(150,1);
for i = 1:150
    foldLabels(idxs(i)) = sampleLabel;
    sampleLabel = sampleLabel + 1;
    if ( sampleLabel > 10 )
       sampleLabel = 1;
    end
end

【讨论】：

仍在处理此问题，但中途发布，以便人们指出错误/您可以开始提问
嗯，好的，所以我应该更清楚地说明这一点。基本上，为了方便起见，我们将您的两个数组混合在一起。一个是 4 个元素宽，另一个只有 1 个，对吧？即 5 列。我们还为稍后提供该样本的标签保留 1。清楚吗？
好吧，我刚刚发布了一个更简单的解决方案，但是答案 #1 中的那一行只是从向量中删除了该行。向量在该元素消失后物理缩短了 1（将其视为 MATLAB 的 NULL）
@Null-Hypothesis 请注意，由于随机索引查找，#2 的性能可能比 #1 差。因此，更好的方法是使用您尚未选择的索引生成另一个向量并从中删除您拥有的索引。
如果您可以使用 randperm，我的答案末尾的最后一点就是为您的数据生成折叠#标签的向量