Sklearn LogisticRegressionCV 的类似输入的数组答案

【问题标题】：Array like input for Sklearn LogisticRegressionCVSklearn LogisticRegressionCV 的类似输入的数组
【发布时间】：2017-06-28 08:40:05
【问题描述】：

最初，我从.csv 文件中读取数据，但在这里我从列表中构建数据框，以便重现问题。目的是使用LogisticRegressionCV 训练具有交叉验证的逻辑回归模型。

indeps = ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F']
dep = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

data = [indeps, dep] 
cols = ['state', 'cat_bins']

data_dict = dict((x[0], x[1]) for x in zip(cols, data))

df = pd.DataFrame.from_dict(data_dict)
df.tail()

    cat_bins    state
45  0.0           F
46  0.0           M
47  0.0           M
48  0.0           F
49  0.0           F


'''Use Pandas' to encode independent variables.  Notice that
 we are returning a sparse dataframe '''

def heat_it2(dataframe, lst_of_columns):
    dataframe_hot = pd.get_dummies(dataframe,
                                   prefix = lst_of_columns,
                                   columns = lst_of_columns, sparse=True,)
    return dataframe_hot

train_set_hot = heat_it2(df, ['state'])
train_set_hot.head(2)

    cat_bins    state_F     state_M
0     1.0         0            1
1     1.0         1            0

'''Use the dataframe to set up the prospective inputs to the model as numpy arrays'''

indeps_hot = ['state_F', 'state_M']

X = train_set_hot[indeps_hot].values
y = train_set_hot['cat_bins'].values

print 'X-type:', X.shape, type(X)
print 'y-type:', y.shape, type(y)
print 'X has shape, is an array and has length:\n', hasattr(X, 'shape'), hasattr(X, '__array__'), hasattr(X, '__len__')
print 'yhas shape, is an array and has length:\n', hasattr(y, 'shape'), hasattr(y, '__array__'), hasattr(y, '__len__')
print 'X does have attribute fit:\n',hasattr(X, 'fit')
print 'y does have attribute fit:\n',hasattr(y, 'fit')

X-type: (50, 2) <type 'numpy.ndarray'>
y-type: (50,) <type 'numpy.ndarray'>
X has shape, is an array and has length:
True True True
yhas shape, is an array and has length:
True True True
X does have attribute fit:
False
y does have attribute fit:
False

因此，回归量的输入似乎具有.fit 方法的必要属性。它们是 具有正确形状的 numpy 数组。 X 是一个维度为[n_samples, n_features] 的数组，y 是一个形状为[n_samples,] 的向量这是文档：

fit(X, y, sample_weight=None)[来源]

Fit the model according to the given training data.
Parameters: 

X : {array-like, sparse matrix}, shape (n_samples, n_features)

    Training vector, where n_samples is the number of samples and n_features is the number of features.
  y : array-like, shape (n_samples,)

Target vector relative to X.

....

现在我们尝试拟合回归量：

logmodel = LogisticRegressionCV(Cs =1, dual=False , scoring = accuracy_score, penalty = 'l2')
logmodel.fit(X, y)

...

    TypeError: Expected sequence or array-like, got estimator LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
    intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
    penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
    verbose=0, warm_start=False)

错误信息的来源似乎在 scikits 的 validation.py 模块中，here。

引发此错误消息的唯一代码部分是以下函数-sn-p：

def _num_samples(x):
    """Return number of samples in array-like x."""
    if hasattr(x, 'fit'):
        # Don't get num_samples from an ensembles length!
        raise TypeError('Expected sequence or array-like, got '
                        'estimator %s' % x)
    etc.

问题：既然我们用来拟合模型的参数（X 和y）没有属性'fit'，为什么会出现这个错误信息

在 Canopy 1.7.4.3348（64 位）和 scikit-learn 18.01-3 和 pandas 0.19.2-2 上使用 python 2.7

感谢您的帮助:)

【问题讨论】：

标签： python pandas scikit-learn logistic-regression

【解决方案1】：

问题似乎出在scoring 参数中。你已经通过accuracy_score。 accuracy_score 的签名是 accuracy_score(y_true, y_pred[, ...])。但是在模块logistic.py

if isinstance(scoring, six.string_types):
    scoring = SCORERS[scoring]
for w in coefs:
    // Other code
    if scoring is None:
        scores.append(log_reg.score(X_test, y_test))
    else:
        scores.append(scoring(log_reg, X_test, y_test))

由于你已经通过accuracy_score，它不适合上面的第一行。而scores.append(scoring(log_reg, X_test, y_test)) 用于对估计器进行评分。但正如我上面所说，这里的参数与accuracy_score 所需的参数不匹配。因此出现错误。

解决方法：在 LogisticRegressionCV 中使用 make_scorer(accuracy_score) 进行评分，或者直接传递字符串 'accuracy'

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
                                scoring = make_scorer(accuracy_score), 
                                penalty = 'l2')

                         OR

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
                                scoring = 'accuracy', 
                                penalty = 'l2')

注意：

这可能是logistic.py 模块或 LogisticRegressionCV 文档中的一个错误，他们应该已经阐明了评分函数的签名。

您可以提交an issue to the github and see how it goes完成

【讨论】：

谢谢，您的两个建议都避免了错误。能否请您告诉我错误消息来自源代码的哪一部分。
错误的来源与您在问题中指出的相同。但是为什么会出现，因为评分函数提供了不正确的参数。从哪里提供了不正确的参数，我在第一个代码 sn-p 的答案中显示了这一点。