在 scikit-learn 中为分类任务添加协变量答案

【问题标题】：Add covariates to classification task in scikit-learn在 scikit-learn 中为分类任务添加协变量
【发布时间】：2020-06-21 10:31:30
【问题描述】：

对于我的项目，我想构建一个分类器，根据结构 MRI 数据中的体素值特征集来预测我的受试者的类别（患者与健康对照）。我使用sklearn.linear_model.LogisticRegression 作为分类器。由于年龄和性别对 sMRI 数据中的体素强度有影响，因此我想将它们作为协变量包含在我的分类任务中。我怎样才能在 scikit-learn 中做到这一点？我只是将它们添加到我的功能集中吗？如果是，我该如何处理协变量的不同尺度（年龄是连续的，性别是分类的）？

这是一个简单的虚拟示例：

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

rng = np.random.RandomState(42)

# dummy feature set (columns represent voxels)
X = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])

# dummy labels (1 = patients, 0= healthy controls)
y = np.array([1,0,1,0])

# dummy covariates (age and gender) - These should be included in my classification task
age = np.array([18,25,31,55])
gender = np.array([1,1,0,0])

# z-standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# classification task
lr = LogisticRegression(random_state=rng)
lr.fit(X, y)
predictions = lr.predict(X)

这篇文章可能与earlier one相关

【问题讨论】：

标签： scikit-learn

【解决方案1】：

对于我的神经影像学预测模型，我通常会构建 2 个模型。一个包含感兴趣的数据，另一个包含年龄等。如果性能没有显着变化，则年龄等对数据的预测能力没有贡献。

当然，您应该对这类问题使用交叉验证方案。

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

rng = np.random.RandomState(42)

# dummy feature set (columns represent voxels)
X = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])

# dummy labels (1 = patients, 0= healthy controls)
y = np.array([1,0,1,0])

# dummy covariates (age and gender) - These should be included in my classification task
age = np.array([18,25,31,55])
gender = np.array([1,1,0,0])

Xfull = np.concatenate([X,age.reshape(-1,1),gender.reshape(-1,1)], axis = 1)

# z-standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# z-standardize features with covariates 
scaler2 = StandardScaler()
Xfull = scaler2.fit_transform(Xfull)


# classification task - model 1
lr1 = LogisticRegression(random_state=rng)
lr1.fit(X, y)
print("Score using only voxel data: {}".format(lr.score(X,y)))

# classification task - model 2
lr2 = LogisticRegression(random_state=rng)
lr2.fit(Xfull, y)
print("Score using voxel data & covariates: {}".format(lr2.score(Xfull,y)))

【讨论】：

感谢您的回答。所以这意味着，我只是将它们添加到我的功能集中。我不必将年龄和性别与我的体素值区别对待，尽管性别将是我数据集中唯一的分类特征？我认为当我对数据进行标准化时，我必须区别对待性别。
在任何情况下，所有功能都应采用相同的标准化。在此之前，您可能需要先转换分类变量