Pyplot 无法绘制回归答案

【问题标题】：Pyplot cannot plot RegressionPyplot 无法绘制回归
【发布时间】：2017-03-30 20:49:33
【问题描述】：

我试图模仿这个非常简单的例子

N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2  # 0 to 15 point radiuses

print(type(x),type(y))
print('training samples ',len(x),len(y))
plt.scatter(x, y, c=colors, alpha=0.5)
plt.show()

这表明

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
training samples  50 50

正如预期的那样，情节也出现了。现在我正在尝试将GradientBoostingRegressor 的结果绘制为

base_regressor = GradientBoostingRegressor()
base_regressor.fit(X_train, y_train)
y_pred_base = base_regressor.predict(X_test)

print(type(X_train),type(y_train))
print('training samples ',len(X_train),len(y_train))
print(type(X_test),type(y_pred_base))
print('base samples ',len(X_test),len(y_pred_base))

plt.figure()

plt.scatter(X_train, y_train, c="k", label="training samples")
plt.plot(X_test, y_pred_base, c="g", label="n_estimators=1", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Base Regression")
plt.legend()
plt.show()

请注意，X_train、y_train 和 X_test 都是 numpy 数组。对于上面的代码，我得到了

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
training samples  74067 74067
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
base samples  166693 166693

但情节没有显示出来，我得到了错误

ValueError: x and y must be the same size

在

plt.scatter(X_train, y_train, c="k", label="training samples")

但从输出中可以看出，x 和 y 具有相同的大小和类型。我做错了什么？

【问题讨论】：

而不是打印len(X_test) 你能打印X_test.shape 吗？
感谢您的建议，现在我收到了 training samples (74067, 163) (74067,) 和 base samples (166693, 163) (166693,)
这对于我的训练维度来说是有意义的，我有 163 列
我想我的意思是，我如何将y_train 与X_train 和y_pred_base 与X_test 进行对比？
那么，您的训练数据是如何组织的？您说 X_train 和 y_train 是二维数组，那么您希望如何绘制它们？ plt.scatter 将在二维图上绘制一维 y 数组与一维 x 数组。

标签： python numpy matplotlib scikit-learn

【解决方案1】：

您的 X_train 数组是二维的，每个样本有 163 列。您不能针对整个 X_train 数组绘制只有一维的 y_train 数组。 y_pred_base 针对 X_test 的图也是如此。

您必须选择 X 数组中的一列进行绘图，编辑您的代码，如下所示：

plt.scatter(X_train[:, 17], y_train, c="k", label="training samples")
plt.plot(X_test[:, 17], y_pred_base, c="g", label="n_estimators=1", linewidth=2)

您的因变量 (X) 存在于 163 维空间中。每个 y 值取决于每个维度中对应的 x 值。简单的二维散点图或线图无法一次显示所有信息。

您可以做的一件事是找出您的 y 值最依赖于哪些 x 变量。您可以使用 base_regressor.feature_importances_ 属性访问它。文档here 中有一个示例。然后你可以针对最重要的人制定一个阴谋。您可以使用 3D 散点图在多个维度上执行此操作，或者使用 corner.py 之类的更高维度执行此操作

【讨论】：

有道理。我仍然不明白如何实现我想看到的。基本上，我想为训练数据绘制 y 与 x 的图表，然后将其与另一张 y 与 x 的图表进行比较以获取测试数据（预测后）。那有意义吗？我只是想比较形状/分布
我编辑了我的答案以包含解释，因为它太长而无法放入评论中。希望有帮助:)