为什么我自己的逻辑回归实现与 sklearn 不同？答案

【问题标题】：why does my own implementation of logistic regression differ from sklearn?为什么我自己的逻辑回归实现与 sklearn 不同？
【发布时间】：2021-06-15 17:45:09
【问题描述】：

我正在尝试在 Python 中从头开始为二进制分类问题实现逻辑回归。我的结果与 sklearn 实现提供的结果不匹配，正如您在 example 中看到的那样。请注意，这些线条看起来“相似”，但它们显然不一样。

我处理了answer 中提到的内容：sklearn 和我 (i) 都符合截距项，并且； (ii) 不应用正则化（penalty='none'）。此外，虽然 sklearn 应用 100 次迭代来训练算法（默认情况下），但我应用 10000 次迭代，学习率相当小，为 0.01。我尝试了不同的值组合，但问题似乎与此无关。

同时，我确实注意到，即使在将结果与 sklearn 进行比较之前，我通过我的实现获得的结果似乎是错误的：在某些情况下，决策区域显然是关闭的。您可以在 image 中看到一个示例。

最后一点似乎表明问题都是我自己的错。这是我的代码（它实际上在每次运行时都会生成新的数据集并绘制结果）：

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression

def create_training_set():
    X0, y = make_blobs(n_samples=[100, 100],
                   centers=None,
                   n_features=2,
                   cluster_std=1)
    y = y.reshape(-1, 1) # make y a column vector
    return np.hstack([np.ones((X0.shape[0], 1)), X0]), X0, y

def create_test_set(X0):
    xx, yy = np.meshgrid(np.arange(X0[:, 0].min() - 1, X0[:, 0].max() + 1, 0.1),
                         np.arange(X0[:, 1].min() - 1, X0[:, 1].max() + 1, 0.1))
    X_test = np.c_[xx.ravel(), yy.ravel()]
    X_test = np.hstack([np.ones((X_test.shape[0], 1)), X_test])
    return xx, yy, X_test

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def apply_gradient_descent(theta, X, y, max_iter=1000, alpha=0.1):
    m = X.shape[0]
    cost_iter = []
    for _ in range(max_iter):
        p_hat = sigmoid(np.dot(X, theta))
        cost_J = -1/float(m) * (np.dot(y.T, np.log(p_hat)) + np.dot((1 - y).T, np.log(1 - p_hat)))
        grad_J = 1/float(m) * np.dot(X.T, p_hat - y)
        theta -= alpha * grad_J
        cost_iter.append(float(cost_J))
    return theta, cost_iter

fig, ax = plt.subplots(10, 2, figsize = (10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
all_cost_history = []
for n_fil in range(10):
    X_train, X0, y = create_training_set()
    xx, yy, X_test = create_test_set(X0)
    
    theta, cost_evolution = apply_gradient_descent(np.zeros((X_train.shape[1], 1)), X_train, y, max_iter, alpha)   
    all_cost_history.append(cost_evolution)
    
    y_pred = np.where(sigmoid(np.dot(X_test, theta)) > 0.5, 1, 0)
    y_pred = y_pred.reshape(xx.shape)
    ax[n_fil, 0].pcolormesh(xx, yy, y_pred, cmap = cmap_light)
    ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y.ravel(), cmap=cmap_bold, alpha = 1, edgecolor="black")
    
    y = y.reshape(X_train.shape[0], )
    clf = LogisticRegression().fit(X0, y)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap = cmap_light)
    ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha = 1, edgecolor="black")
plt.show()

【问题讨论】：

标签： python machine-learning scikit-learn logistic-regression

【解决方案1】：

您的实现与 Sklearn 的实现之间实际上存在差异：您没有使用相同的优化算法（在 sklearn 中也称为求解器），我认为您观察到的差异来自这里。您正在使用梯度下降，而 sklearn 的实现默认使用“liblinear”求解器，这是不同的

确实，不同的优化算法可以产生不同的结果，例如：

收敛速度：由于我们限制了迭代次数，收敛速度较慢的算法会在不同的最小值处停止，从而产生不同的决策区域
算法是否具有确定性：非确定性算法（例如随机梯度下降）可以收敛到给定相同数据集的不同局部最小值。使用非确定性算法，您可以使用完全相同的数据集和算法观察到不同的结果。
超参数：更改超参数（例如梯度下降算法的学习率）也会改变优化算法的行为，从而导致不同的结果。

在您的情况下，有充分的理由不总是得到相同的结果：您使用的梯度下降算法可能会陷入局部最小值（因为迭代次数不足，非最佳学习率......）这可能与 liblinear 求解器达到的局部最小值不同。

如果您将 sklearn 的实现与不同的求解器（重用您的代码）进行比较，您会发现同样的差异：


fig, ax = plt.subplots(10, 2, figsize=(10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
solver_algo_1 = 'liblinear'
solver_algo_2 = 'sag'

for n_fil in range(10):
    X_train, X0, y = create_training_set()
    xx, yy, X_test = create_test_set(X0)

    y = y.reshape(X_train.shape[0], )

    clf = LogisticRegression(solver=solver_algo_1, max_iter=max_iter).fit(X0, y)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax[n_fil, 0].pcolormesh(xx, yy, Z, cmap=cmap_light)
    ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")

    clf = LogisticRegression(solver=solver_algo_2, max_iter=max_iter).fit(X0, y)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap=cmap_light)
    ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")
plt.show()

例如，使用“liblinear”（左）和“newton-cg”（右），您可以得到：

虽然 Logistc 回归的实现是一样的，但是优化算法的不同导致了不同的结果。所以简而言之，你的实现和 Scikit learn 的区别在于优化算法。

现在，如果您得到的决策边界质量不令人满意，您可以尝试调整梯度下降算法的超参数或尝试更改优化算法！

【讨论】：

您好，感谢您的回答，非常感谢。我觉得求解器可以在这里发挥作用。然而，关闭正则化，损失函数是一个普通的对数损失函数，它是凸的。因此，尽管不同的求解器可能更有效，但我的感觉是它们都会收敛到相同的唯一和全局最小值（至少是确定性的，所以我相信除了“sag”和“saga”之外的所有求解器）。无论如何，我认为您的答案是这里要考虑的关键点。顺便说一句，默认求解器是“lbfgs”，至少在 0.24.1 版本中。谢谢！
我同意，逻辑回归问题中不应该存在局部最小值。现在，这里的数据（大多数情况下）是可分离的，因此实际上没有最小值：系数可能趋于无穷大并产生越来越低的损失。而且（通常）这个数据集有许多完美的分隔线，所以找到哪一个取决于求解器。但是问题中链接的其中一张图像清楚地表明了对于不可分离的数据实例的次优解决方案。所以我觉得这个实现肯定还是有问题的。