【发布时间】:2016-05-28 20:48:16
【问题描述】:
我正在阅读 csv 并尝试基于 df['LSTAT'] (x/variable) vs. 建立一个线性回归模型。 df['MEDV'](y/目标)。但是,在模型拟合阶段不断弹出错误消息“ValueError:找到样本数量不一致的数组:[1 343]”。
我对数据进行了整形/重新整形(不确定我是否做得正确)或将 pd.DataFrame 转换为 numpy 数组和列表。它们都不起作用。看完这篇文章我还是不太明白这个问题:sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()。脚本和错误消息如下。
任何大师都可以提供一些详细解释的解决方案吗?谢谢!
import scipy.stats as stats
import pylab
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
import sklearn
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
df=pd.read_csv("input.csv")
X_train1, X_test1, y_train1, y_test1 = train_test_split(df['LSTAT'],df['MEDV'],test_size=0.3,random_state=1)
lin=LinearRegression()
################## This line: " lin_train=lin.fit(X_train1,y_train1)" causes the trouble.
lin_train=lin.fit(X_train1,y_train1)
################## The followings are just the plotting lines after fitting the Linear regression
# The coefficients
print('Coefficients: \n', lin.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((lin.predict(X_test1) - y_test1) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lin.score(X_test1, y_test1))
# Plot outputs
plt.scatter(X_test1, y_test1, color='black')
plt.plot(X_test1, lin.predict(X_test1), color='blue',linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
这是警告和错误信息:
Warning (from warnings module):
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 386
DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
Traceback (most recent call last):
File "C:/Users/Pin-Chih/Google Drive/Real_estate_projects/test.py", line 36, in <module>
lin_train=lin.fit(X_train1,y_train1)
File "C:\Python27\Lib\site-packages\sklearn\linear_model\base.py", line 427, in fit
y_numeric=True, multi_output=True)
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 520, in check_X_y
check_consistent_length(X, y)
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 176, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [ 1 343]>>>
如果我打印出“x_train1”:
X_train1:
61 26.82
294 12.86
39 29.29
458 4.85
412 8.05
Name: LSTAT, dtype: float64
如果我打印出“y_train1”:
y_train1:
61 13.4
294 22.5
39 11.8
458 35.1
412 29.0
Name: MEDV, dtype: float64
【问题讨论】:
标签: python-2.7 pandas scikit-learn