【问题标题】:how to transform test data to pick same features as training dataset如何转换测试数据以选择与训练数据集相同的特征
【发布时间】:2020-05-10 03:23:03
【问题描述】:

简单地说,我正在尝试将相同的特征选择应用于测试数据,就像我对训练集所做的那样,但是测试没有完全相同的形状。

def get_important_features (X_train, Y_train, X_test):
    '''

    :param X_train: features of training set of type scipy.sparse.csr_matrix
    :param Y_train: labels of training set of type scipy.sparse.csr_matrix
    :param X_test: features of test set of type scipy.sparse.csr_matrix
    :return: 
    '''
    select_percentile = SelectPercentile(chi2, percentile=75)

    print(X_train.shape)
    print(X_test.shape)
    X_new_train = select_percentile.fit_transform(X_train, Y_train)
    #print(select_percentile.get_support(indices=True))
    X_new_test = select_percentile.transform(X_test)
    return X_new_train,  X_new_test

所以训练集形状(836, 3188) 和测试集形状(633, 3187) 如您所见,测试集的形状与训练集不同,但是我只关心在应用chi2 后选择训练集中存在的特征.另外,由于我上面提到的原因,您可能知道X_new_test = select_percentile.transform(X_test) 抛出值错误ValueError: X has a different shape than during fitting.。有什么方法可以在不使用transform(X_test) 的情况下从X_test 中提取这些特征?

注意:输入是 csr 矩阵而不是数据框,所以我从 libsvm 格式文档中获取这些值。

 train= load_svmlight_file(train_file_name)
 X_train = train[0]
 Y_train = train[1]
 test= load_svmlight_file(test_file_name)
 X_test = test[0]
 Y_test = test[1]

【问题讨论】:

    标签: python-3.x scikit-learn scipy


    【解决方案1】:

    我尝试了您的功能并且它有效。确保您以正确的方式传递数据。以下是供您参考的最小示例:

    from sklearn.feature_selection import SelectPercentile
    from sklearn.feature_selection import chi2
    
    # dummy data
    train = pd.DataFrame(np.random.randint(1000, size=(50, 10)), columns=['A'+str(x) for x in range(10)])
    test = pd.DataFrame(np.random.randint(1000, size=(30, 9)), columns=['A'+str(x) for x in range(9)])
    
    # assuming the last column is the target variable
    X_new_train,  X_new_test = get_important_features(train.iloc[:,:-1], train.iloc[:,-1], test)
    
    print(X_new_train.shape,  X_new_test.shape)
    (50, 6) (30, 6)
    

    【讨论】:

    • 输入是稀疏矩阵而不是数据帧,这就是为什么我很难操作它
    猜你喜欢
    • 2023-03-14
    • 1970-01-01
    • 2019-03-18
    • 1970-01-01
    • 2017-04-05
    • 2017-02-21
    • 2015-01-17
    • 2018-08-19
    • 2014-04-30
    相关资源
    最近更新 更多