不同的 Python 最小化函数给出不同的值，为什么？答案

【问题标题】：Different Python minimization functions give different values, Why?不同的 Python 最小化函数给出不同的值，为什么？
【发布时间】：2014-01-09 18:58:04
【问题描述】：

我正在尝试通过重写 Andrew Ng 的 Octave 机器学习课程作业来学习 Python（我参加了课程并获得了证书）。我遇到了优化功能的问题。在课程中，他们使用 fmincg，这是 Octave 中使用的函数，用于最小化线性回归的成本函数（凸函数），提供其导数。他们还教你如何使用梯度下降和正规方程，如果使用正确，理论上它们都会给你相同的结果（在小数点后几位以内）。它们都非常适合线性回归，我在 python 中得到了相同的结果。为了清楚起见，我试图最小化成本函数以找到数据集的最佳拟合参数 (theta)。到目前为止，我使用了不需要衍生物的“nelder-mead”，它给了我最接近他们所拥有的解决方案。我还尝试过“TNC”、“CG”和“BFGS”，它们都需要一个导数来最小化函数。当我有一阶多项式（线性）时，它们都工作得很好，但是当我将多项式的阶数增加到非线性时，在我的情况下，我有 x^1 到 x^8，然后我无法得到我的函数来拟合数据集。我正在做的练习非常简单，我有 12 个数据点，所以放置一个 8 阶多项式应该可以捕获每一个点（如果你很好奇，这是一个高方差的例子，即过度拟合数据）。他们展示的解决方案是一条按预期穿过所有数据点并捕获所有内容的线。我得到的最好的结果是当我使用“nelder-mead”方法时，它只从数据集中捕获了两个点，而其余的最小化函数甚至没有给我任何接近我正在寻找的东西。我不确定出了什么问题，因为我的成本函数和梯度为线性情况提供了正确的值，所以我假设它们工作正常（八度的确切答案）。

我将列出 Octave 和 python 中的函数，希望有人能向我解释为什么我得到不同的答案。或者指出我没有看到的明显错误。

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear 
%regression with multiple variables
%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the 
%   cost of using theta as the parameter for linear regression to fit the 
%   data points in X and y. Returns the cost in J and the gradient in grad


m = length(y); % number of training examples 
J = 0;
grad = zeros(size(theta));

htheta = X * theta;
n = size(theta);
J = 1 / (2 * m) * sum((htheta - y) .^ 2) + lambda / (2 * m) * sum(theta(2:n) .^ 2);

grad = 1 / m * X' * (htheta - y);
grad(2:n) = grad(2:n) + lambda / m * theta(2:n); # we leave the bias nice 
grad = grad(:);

end

这是我的代码的 sn-ps，如果有人喜欢完整的代码，我也可以提供：

def costFunction(theta, Xcost, y, lmda):
    m = len(y)
    theta = theta.reshape((len(theta),1))
    htheta = np.dot(Xcost,theta) - y 
    J = 1 / (2 * m) * np.dot(htheta.T,htheta) + lmda / (2 * m) * np.sum(theta[1:,:]**2)
    return J

def gradCostFunc(gradtheta, X, y, lmda):
    m = len(y)
    gradtheta = gradtheta.reshape((len(gradtheta),1))
    hgradtheta = np.dot(X,gradtheta) - y 
    #gradtheta[0,0] = 0. 

    grad = (1 / m) * np.dot(X.T, hgradtheta)

    #for i in range(1,len(grad)):
    grad[1:,0] = grad[1:,0] + (lmda/m) * gradtheta[1:,0]
    return grad.reshape((len(grad)))

def normalEqn(X, y, lmda):
    e = np.eye(X.shape[1])
    e[0,0] = 0
    theta = np.dot(np.linalg.pinv(np.dot(X.T,X) + lmda * e),np.dot(X.T,y))
    return theta 

def gradientDescent(X, y, theta, alpha, lmda, num_iters):
    # calculate gradient descent in an iterative manner
    m = len(y)
    # J_history tracks the evolution of the cost function 
    J_history = np.zeros((num_iters,1))

    # Calculating the gradients 
    for i in range(0, num_iters):
        grad = np.zeros((len(theta),1))
        grad = gradCostFunc(theta, X, y, lmda)
        #updating the thetas 
        theta = theta - alpha * grad 
        J_history[i] = costFunction(theta, X, y, lmda)

    plt.plot(J_history)
    plt.show()

    return theta 

def trainLR(initheta, X, y, lmda):
    #print theta.shape, X.shape, y.shape, gradtest.shape gradCostFunc
    options = {'maxiter': 1000}
    res = optimize.minimize(costFunction, initheta, jac=gradCostFunc, method='CG',                            args=(X, y, lmda), options = options)
    #res = optimize.minimize(costFunction, theta, method='nelder-mead',                             args=(X,y,lmda), options={'disp': False})
    #res = optimize.fmin_bfgs(costFunction, theta, fprime=gradCostFunc, args=(X, y, lmda))
    return res.x

def polyFeatures(X, degree):
    # map the higher polynomials 
    out = X 
    if degree >= 2:
        for i in range(2,degree+1):
            out = np.column_stack((out,X**i))
        return out 
    else:
        return out

def featureNormalize(X):
    # Since the values will vary by orders of magnitudes 
    # It’s important to normalize the various features 
    mu = np.mean(X, axis=0)
    S1 = np.std(X, axis=0)
    return mu, S1, (X - mu)/S1

这里是这些函数的主要调用：

X, y, Xval, yval, Xtest, ytest = loadData('ex5data1.mat')
X_poly = X # to be used in the later on in the program 
p = 8 
X_poly = polyFeatures(X_poly, p)
mu, sigma, X_poly = featureNormalize(X_poly)
X_poly = padding(X_poly)
theta = np.zeros((X_poly.shape[1],1))
theta = trainLR(theta, X_poly, y, 0.)
#theta = normalEqn(X_poly, y, 0.)
#theta = gradientDescent(X_poly, y, theta, 0.1, 0, 1500)

【问题讨论】：

为什么不将每一步的结果与八度音阶的正确结果进行比较？您可以打印成本函数和梯度成本函数的中间结果。

标签： python machine-learning octave linear-regression

【解决方案1】：

我的回答可能不恰当，因为您的问题是为了帮助调试您当前的实现。

也就是说，如果您对在 Python 中使用现成的优化器感兴趣，请查看OpenOpt。该库包含针对各种优化问题的优化器的合理性能实现。

我还应该提到scikit-learn 库为 Python 提供了一个不错的机器学习工具集。

【讨论】：

如果您可以帮助找到很棒的错误，但是我在这里要问的所有问题，为什么我会为不同的功能得到不同的答案？
@Henry80s：我想我误读了你原来的问题。您是否 (a) 在使用 Octave 或 Python 时得到了不同的答案，或者 (b) 在高阶多项式拟合中得到了意想不到的结果？
我在这里要问的是为什么我会为不同的功能得到不同的答案？另外，我的 costFunction 和 gradCostFunction 看起来正确吗？
只有当我得到高阶多项式而不是线性情况时才会得到这个
由于您对正则化不感兴趣，因此可以尝试暂时删除正则化术语。这将确保过度拟合不会受到惩罚，因此有更好的机会拟合所有点。我也很好奇你的 X 矩阵和 theta 向量的尺寸。两者是否都将多项式阶数作为它们的维度之一，并且 X 中的条目之一是否由 [1 x x^2 ... x^8] 组成？谢谢。