nickchen121

人工智能从入门到放弃完整教程目录:https://www.cnblogs.com/nickchen121/p/11686958.html

多元线性回归(波士顿房价预测)

一、导入模块

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
%matplotlib inline
font = FontProperties(fname=\'/Library/Fonts/Heiti.ttc\')

二、获取数据

df = pd.read_csv(\'housing-data.txt\', sep=\'\s+\', header=0)
X = df.iloc[:, :-1].values
y = df[\'MEDV\'].values
# 将数据分成训练集(0.7)和测试集(0.3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

三、训练模型

lr = LinearRegression()
# 训练模型
lr.fit(X_train, y_train)
# 预测训练集数据
y_train_predict = lr.predict(X_train)
# 预测测试集数据
y_test_predict = lr.predict(X_test)

四、可视化

# y_train_predict-y_train训练数据误差值
plt.scatter(y_train_predict, y_train_predict-y_train, c=\'r\',
            marker=\'s\', edgecolor=\'white\', label=\'训练数据\')
# y_train_predict-y_train测试数据误差值
plt.scatter(y_test_predict, y_test_predict-y_test, c=\'g\',
            marker=\'o\', edgecolor=\'white\', label=\'测试数据\')
plt.xlabel(\'预测值\', fontproperties=font)
plt.ylabel(\'误差值\', fontproperties=font)
# 可视化y=0的一条直线即误差为0的直线
plt.hlines(y=0, xmin=-10, xmax=50, color=\'k\')
plt.xlim(-10, 50)
plt.legend(prop=font)
plt.show()

五、均方误差测试

from sklearn.metrics import mean_squared_error

# 训练集的均方误差
train_mse = mean_squared_error(y_train,y_train_predict)
# 测试集的均方误差
test_mse = mean_squared_error(y_test,y_test_predict)
print(\'训练集的均方误差:{}\'.format(train_mse))
print(\'测试集的均方误差:{}\'.format(test_mse))
训练集的均方误差:23.049177061822277
测试集的均方误差:19.901828312902534

训练集的均方误差是19.4,而测试集的均方误差是28.4,可以发现测试集的误差更大了,也就是说训练集过拟合了。

分类:

技术点:

相关文章: