【问题标题】:Splitting a dataset into training and test datasets given a ratio将数据集拆分为给定比例的训练和测试数据集
【发布时间】:2022-12-24 09:29:33
【问题描述】:

对于一个学校项目,我需要将一个数据集分成训练集和测试集,并给出一定的比例。该比率是用作训练集的数据量,而其余的将用作测试集。我根据教授的要求创建了一个基本实现,但我无法让它通过他创建的测试。下面是我的实现以及参数和返回变量代表什么

def splitData(X, y, split_ratio = 0.8):
'''
X: numpy.ndarray. Shape = [n+1, m]
y: numpy.ndarray. Shape = [m, ]
split_ratio: the ratio of examples go into the Training, Validation, and Test sets.
Split the whole dataset into Training, Validation, and Test sets.
:return: return (training_X, training_y), (test_X, test_y).
        training_X is a (n+1, m_tr) matrix with m_tr training examples;
        training_y is a (m_tr, ) column vector;
        test_X is a (n+1, m_test) matrix with m_test test examples;
        test_y is a (m_test, ) column vector.
'''
## Need to possible shuffle X array and Y array

## amount used for training
m_tr = len(X) * train_ratio

##m_test = len(X) - m_tr Amount that is used for testing

training_X = X[1:m_tr]
training_y = y[1:m_tr]
test_X = [m_tr:len(X)]
test_y = [m_tr:len(y)]
return training_X, training_y, test_X, test_y

由于说明,我包含了声明 m_test 的评论,但我很确定将数组从第一个元素拆分为 m_tr 给出了总训练量,其余部分是测试数据。通过迭代从 m_tr 到 len(x) 或 len(y) 的每个列表来找到测试数据。我误解了拆分的工作原理吗?

PS - 教授说我们可以跳过验证的拆分。

【问题讨论】:

    标签: python numpy machine-learning


    【解决方案1】:

    主要有3个问题:

    1. 在文档中指定您需要剪切,不是行
    2. 你应该返回 2 对,而不是长度为 4 的元组
    3. 出于某种原因,您在使用“1:”而不是“0:”剪切时删除了第 0 个样本
      def splitData(X, y, split_ratio = 0.8):
      '''
      X: numpy.ndarray. Shape = [n+1, m]
      y: numpy.ndarray. Shape = [m, ]
      split_ratio: the ratio of examples go into the Training, Validation, and Test sets.
      Split the whole dataset into Training, Validation, and Test sets.
      :return: return (training_X, training_y), (test_X, test_y).
              training_X is a (n+1, m_tr) matrix with m_tr training examples;
              training_y is a (m_tr, ) column vector;
              test_X is a (n+1, m_test) matrix with m_test test examples;
              test_y is a (m_test, ) column vector.
      '''
        m_tr = int(len(X) * train_ratio)
        training_X = X[:, :m_tr]
        training_y = y[:m_tr]
        test_X = X[:, m_tr:]
        test_y = y[m_tr:]
        return (training_X, training_y), (test_X, test_y)
      

    【讨论】:

      【解决方案2】:
      1. 函数参数称为 split_ratio,但在实现函数时使用 train_ratio。
      2. 变量 m_tr 是列表(数据)的长度乘以比率(split_ratio)的结果,这种运算的结果可以是浮点数。你用来分割数据的切片只接受整数。
      3. 对于 test_X 和 test_y,您没有在切片之前提供数据。
      4. 对于 training_X 和 training_y,您从第二个元素开始切片,因为您指定了 1,而不是 0。因此您丢失了第一个数据元素。

        我纠正了你的错误:

        def splitData(X, y, split_ratio = 0.8):
            
            m_tr = int(len(X) * split_ratio)
            training_X = X[:, :m_tr]
            training_y = y[:m_tr]
            test_X = X[:, m_tr:]
            test_y = y[m_tr:]
            return (training_X, training_y), (test_X, test_y)
        

      【讨论】:

        猜你喜欢
        • 2019-03-07
        • 2019-05-01
        • 1970-01-01
        • 2019-12-15
        • 2018-11-05
        • 1970-01-01
        • 2019-12-11
        • 2019-06-30
        • 1970-01-01
        相关资源
        最近更新 更多