线性回归算法适用于一个数据集，但不适用于另一个类似的数据集。为什么？答案

【问题标题】：Linear Regression algorithm works with one data-set but not on another, similar data-set. Why?线性回归算法适用于一个数据集，但不适用于另一个类似的数据集。为什么？
【发布时间】：2017-11-23 18:00:54
【问题描述】：

我按照教程创建了一个线性回归算法，并将其应用于提供的数据集，它运行良好。但是，相同的算法不适用于另一个类似的数据集。谁能告诉我为什么会这样？

def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

def gradientDescent(X, y, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))
    params = int(theta.ravel().shape[1])
    cost = np.zeros(iters)

    for i in range(iters):
        err = (X * theta.T) - y

        for j in range(params):
            term = np.multiply(err, X[:,j])
            temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))

        theta = temp
        cost[i] = computeCost(X, y, theta)

    return theta, cost

alpha = 0.01
iters = 1000

g, cost = gradientDescent(X, y, theta, alpha, iters)  
print(g)

在通过this 数据集运行算法时，我得到的输出为matrix([[ nan, nan]]) 和以下错误：

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: RuntimeWarning: overflow encountered in power
  from ipykernel import kernelapp as app
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: RuntimeWarning: invalid value encountered in double_scalars

但是this 数据集工作正常并输出matrix([[-3.24140214, 1.1272942 ]])

这两个数据集都相似，我已经研究过很多次了，但似乎无法弄清楚为什么它适用于一个数据集但不适用于另一个数据集。欢迎任何帮助。

编辑：感谢 Mark_M 的编辑提示 :-)

【问题讨论】：

您之所以被否决，是因为您没有花时间隔离问题并在问题中包含代码。没有人愿意翻阅某人的 repo 并找出他们的代码出了什么问题。那不是堆栈溢出。您需要确定问题并提出具体问题。看看这里寻求帮助：stackoverflow.com/help/mcve
@Mark_M 哦，我现在明白了。我会编辑它。我应该如何处理数据集？一个链接就够了吗？

标签： python-3.x machine-learning linear-regression data-science

【解决方案1】：

[更好的问题，顺便说一句]

很难确切地知道这里发生了什么，但基本上您的成本会朝着错误的方向发展并失控，当您尝试平方值时会导致溢出。

我认为在您的情况下，归结为您的步长 (alpha) 太大，这可能导致梯度下降走错路。你需要观察梯度下降的成本，并确保它总是下降，如果它不是坏了或者alpha 太大了。

就个人而言，我会重新评估代码并尝试摆脱循环。这是一个偏好问题，但我发现使用X 和Y 作为列向量更容易。这是一个最小的例子：

from numpy import genfromtxt
# this is your 'bad' data set from github
my_data = genfromtxt('testdata.csv', delimiter=',')

def computeCost(X, y, theta):
    inner = np.power(((X @ theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

def gradientDescent(X, y, theta, alpha, iters):
    for i in range(iters):
        # you don't need the extra loop - this can be vectorize
        # making it much faster and simpler
        theta = theta - (alpha/len(X)) * np.sum((X @ theta.T - y) * X, axis=0)
        cost = computeCost(X, y, theta)
        if i % 10 == 0: # just look at cost every ten loops for debugging
            print(cost)
    return (theta, cost)

# notice small alpha value
alpha = 0.0001
iters = 100

# here x is columns
X = my_data[:, 0].reshape(-1,1)
ones = np.ones([X.shape[0], 1])
X = np.hstack([ones, X]) 

# theta is a row vector
theta = np.array([[1.0, 1.0]])

# y is a columns vector
y = my_data[:, 1].reshape(-1,1)

g, cost = gradientDescent(X, y, theta, alpha, iters)  
print(g, cost)

另一种有用的技术是在进行回归之前对数据进行标准化。当您有多个要最小化的功能时，这尤其有用。

附带说明 - 如果您的步长正确，那么无论您进行多少次迭代，都不应出现溢出，因为每次迭代的成本都会降低，而且降低的速度会减慢。

经过 1000 次迭代后，我得出了一个 theta 和成本：

[[ 1.03533399  1.45914293]] 56.041973778

100 之后：

[[ 1.01166889  1.45960806]] 56.0481988054

您可以使用它来查看 iPython 笔记本的匹配度：

%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(my_data[:, 0].reshape(-1,1), y)
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = g[0][0] + g[0][1]* x_vals
plt.plot(x_vals, y_vals, '--')

【讨论】：

哇，你是对的。我降低了学习率，它收敛了，给出了matrix([[ 0.05905856, 1.47833133]])。我刚开始学习机器学习，我觉得它很棒。感谢您提供有关代码优化的有用建议，我会尽快将它们内化。