八度梯度下降的实现答案

【问题标题】：Gradient Descent implementation in octave八度梯度下降的实现
【发布时间】：2012-05-22 09:50:57
【问题描述】：

我实际上已经为此苦苦挣扎了 2 个月。是什么让这些不同？

hypotheses= X * theta
temp=(hypotheses-y)'
temp=X(:,1) * temp
temp=temp * (1 / m)
temp=temp * alpha
theta(1)=theta(1)-temp

hypotheses= X * theta
temp=(hypotheses-y)'
temp=temp * (1 / m)
temp=temp * alpha
theta(2)=theta(2)-temp



theta(1) = theta(1) - alpha * (1/m) * ((X * theta) - y)' * X(:, 1);
theta(2) = theta(2) - alpha * (1/m) * ((X * theta) - y)' * X(:, 2);

后者有效。我只是不确定为什么..我很难理解对矩阵逆的需要。

【问题讨论】：

我不认为这是梯度下降的正确实现。你需要更新。你的两个 theta 同时是准确的。 tmpTheta1= theta(1) - alpha * (1/m) * ((X * theta) - y)' * X(:, 1); tmpTheta2= theta(2) - alpha * (1/m) * ((X * theta) - y)' * X(:, 2);theta(1)=tmpTheta1;theta(2)=tmpTheta2;

标签： octave

【解决方案1】：

在第一个中，如果 X 是一个 3x2 矩阵并且 theta 是一个 2x1 矩阵，那么“假设”将是一个 3x1 矩阵。

假设y是一个3x1矩阵，那么你可以执行(hypotheses - y)得到一个3x1矩阵，那么那个3x1的转置就是分配给temp的一个1x3矩阵。

然后将 1x3 矩阵设置为 theta(2)，但这不应该是矩阵。

代码的最后两行有效，因为使用我上面的 mxn 示例，

(X * theta)

将是一个 3x1 矩阵。

然后那个 3x1 矩阵减去 y（一个 3x1 矩阵），结果是一个 3x1 矩阵。

(X * theta) - y

所以3x1矩阵的转置是1x3矩阵。

((X * theta) - y)'

最后，一个 1x3 矩阵乘以一个 3x1 矩阵将等于一个标量或 1x1 矩阵，这正是您要寻找的。我相信你已经知道了，但为了彻底，X(:,2) 是 3x2 矩阵的第二列，使其成为 3x1 矩阵。

【讨论】：

【解决方案2】：

当你更新你需要做的事情

Start Loop {

temp0 = theta0 - (equation_here);

temp1 = theta1 - (equation_here);


theta0 =  temp0;

theta1 =  temp1;

} End loop

【讨论】：

【解决方案3】：

您在第二个块的第一个示例中正在做什么，您错过了一步，不是吗？我假设您将 X 与一个向量连接起来。

   temp=X(:,2) * temp

最后一个例子可以工作，但可以进一步向量化以更简单和高效。

我假设您只有 1 个功能。它适用于多个功能，因为所发生的只是为每个功能在 X 矩阵中添加一个额外的列。基本上，您将一个向量添加到 x 以向量化截距。

您可以在一行代码中更新 2x1 的 theta 矩阵。使用 x 连接一个向量，使其成为 nx2 矩阵，然后您可以通过乘以 theta 向量 (2x1) 来计算 h(x)，这是 (X * theta) 位。

矢量化的第二部分是转置 (X * theta) - y)，它会为您提供一个 1*n 矩阵，当它乘以 X（一个 n*2 矩阵）时，基本上会聚合两个 (h(x)- y)x0 和 (h(x)-y)x1。根据定义，两个 thetas 是同时完成的。这导致我的新 theta 的 1*2 矩阵再次转置以围绕向量翻转，使其与 theta 向量的维度相同。然后我可以通过 alpha 和向量减法与 theta 进行简单的标量乘法。

X = data(:, 1); y = data(:, 2);
m = length(y);
X = [ones(m, 1), data(:,1)]; 
theta = zeros(2, 1);        

iterations = 2000;
alpha = 0.001;

for iter = 1:iterations
     theta = theta -((1/m) * ((X * theta) - y)' * X)' * alpha;
end

【讨论】：

为什么需要在 for 循环中转置 (1/m) * ((X * theta) - y)' * X ？
和 Grahm 一样的问题，为什么整个子表达式要转置？
((1/m) * ((X * theta) - y)' * X) 的结果是 1x2。 theta 是 2x1。所以括号之间的位需要转置以具有相同的尺寸并从theta中减去它。
与上述相同的问题。应该是 theta = theta - (alpha/m) * X' * (X * theta - y)
我认为这与矩阵计算的规则有关... A*B = A 行乘 B 行的乘积...我的意思是，从概念上讲，可以使用梯度下降的通用公式，但是接下来，当你在矩阵领域玩时，你必须适应他们自己的规则（逐行乘法，交换限制等）......这只是我的猜测，也许我错了...... .

【解决方案4】：

.
.
.
.
.
.
.
.
.
Spoiler alert












m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

% ====================== YOUR CODE HERE ======================
% Instructions: Perform a single gradient step on the parameter vector
%               theta. 
%
% Hint: While debugging, it can be useful to print out the values
%       of the cost function (computeCost) and gradient here.
% ========================== BEGIN ===========================


t = zeros(2,1);
J = computeCost(X, y, theta);
t = theta - ((alpha*((theta'*X') - y'))*X/m)';
theta = t;
J1 = computeCost(X, y, theta);

if(J1>J),
    break,fprintf('Wrong alpha');
else if(J1==J)
    break;
end;


% ========================== END ==============================

% Save the cost J in every iteration    
J_history(iter) = sum(computeCost(X, y, theta));
end
end

【讨论】：

想法是帮助用户前进，而不是发布完整的家庭练习示例
请添加建议和想法来回答 OP 的问题。纯代码对任何人都没有帮助。

【解决方案5】：

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
% Performs gradient descent to learn theta. Updates theta by taking num_iters 
% gradient steps with learning rate alpha.

% Number of training examples
m = length(y); 
% Save the cost J in every iteration in order to plot J vs. num_iters and check for convergence 
J_history = zeros(num_iters, 1);

for iter = 1:num_iters
    h = X * theta;
    stderr = h - y;
    theta = theta - (alpha/m) * (stderr' * X)';
    J_history(iter) = computeCost(X, y, theta);
end

end

【讨论】：

我们如何在命令行中打印'J_history'？
好吧，我已经解决了：[theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters);

【解决方案6】：

这可以用更简单的向量化

h = X * theta   % m-dimensional matrix (prediction our hypothesis gives per training example)
std_err = h - y  % an m-dimensional matrix of errors (one per training example)
theta = theta - (alpha/m) * X' * std_err

记住X 是设计矩阵，因此X 的每一行代表一个训练示例，X 的每一列代表所有训练示例中的给定组件（例如第零个或第一个组件）。因此，X 的每一列正是我们想要将元素与 std_err 相乘，然后求和以获得 theta 向量的相应分量。

【讨论】：

这似乎一切都很好。但是为什么我们可以转置 X 呢？这不会改变价值吗？这里的许多人只是解释它，我们必须这样做才能使矩阵正确。但为什么？ X' 是导数吗？
``` X * theta ``` 是点积还是元素操作？

【解决方案7】：

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1 : num_iters
    hypothesis = X * theta;
    Error = (hypothesis - y);
    temp = theta - ((alpha / m) * (Error' * X)');
    theta = temp;
    J_history(iter) = computeCost(X, y, theta);
end
end

【讨论】：

请解释这个问题是如何回答的