算法原理:线性回归算法、最小均方差、梯度下降算法参考:http://blog.kamidox.com/gradient-descent.html
里面非常详细地介绍了微积分基本运算法则、线性回归算法、梯度下降算法及改进。
a. 用线性回归方法拟合正弦函数
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# 生成200个[-2pi,2pi]区间内的正弦函数上的点
n_dots =200
X = np.linspace(-2*np.pi,2*np.pi,n_dots)
Y = np.sin(X)+0.2*np.random.rand(n_dots)-0.1
X = X.reshape(-1,1)
Y = Y.reshape(-1,1)
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
def polynomial_model(degree=1):
# degree表示多项式的阶数
polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)
# normalize=True将特征归一化至[0,1]
linear_regression = LinearRegression(normalize=True)
pipeline = Pipeline([("polynomial_features",polynomial_features),("linear_regression",linear_regression)])
return pipeline
# 分别用2、3、5、10阶多项式来拟合数据集
# 用mean_squared_error计算均方根误差,均方根误差越小,说明模型拟合效果越好
from sklearn.metrics import mean_squared_error
degrees = [2,3,5,10]
results=[]
for d in degrees:
model = polynomial_model(degree=d)
model.fit(X,Y)
train_score = model.score(X,Y)
mse = mean_squared_error(Y,model.predict(X))
results.append({"model":model,"degree":d,"score":train_score,"mse":mse})
for r in results:
print("degree: {}; train_score: {}; mean squared error: {};".format(r["degree"],r["score"],r["mse"]))
degree: 2; train_score: 0.150098385123013; mean squared error: 0.4252061468860883;
degree: 3; train_score: 0.27885313996963546; mean squared error: 0.3607900871407268;
degree: 5; train_score: 0.8966304597537259; mean squared error: 0.05171582586046318;
degree: 10; train_score: 0.9931397128987751; mean squared error: 0.0034322046149616835;
可以看出,多项式阶数越高,拟合评分越高,均方误差越小,拟合效果越好。
把不同拟合结果在二维坐标上画出:
from matplotlib.figure import SubplotParams
plt.figure(figsize=(6,3),dpi=200,subplotpars=SubplotParams(hspace=0.5))
for i,r in enumerate(results):
fig=plt.subplot(2,2,i+1)
plt.xlim(-8,8)
plt.title("LinearRegression degree={}".format(r["degree"]),fontsize=6)
plt.xticks(np.linspace(-8,8,9),fontsize=5)
plt.yticks(fontsize=5)
plt.scatter(X,Y,s=1.5,c='b',alpha=0.5)
plt.plot(X,r["model"].predict(X),'r-',linewidth=1)
将10阶多项式线性模型在[-20,+20]范围内的拟合函数画出:
plt.title("LinearRegression degree={}".format(results[3]["degree"]),fontsize=10)
plt.xticks(np.linspace(-20,20,9),fontsize=10)
plt.yticks(fontsize=10)
plt.scatter(X,Y,s=1.5,c='b',alpha=0.5)
X1=np.linspace(-20,20,400)
X1=X1.reshape(-1,1)
plt.plot(X1,results[3]["model"].predict(X1),'r-',linewidth=1)
说明拟合模型只在训练集范围内对训练集有很好的拟合效果,超出训练集范围的拟合结果不能用。
b. 预测房价
用sklearn.datasets自带数据集预测波士顿房价:
from sklearn.datasets import load_boston
boston=load_boston()
X=boston.data
y=boston.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=3)
import time
from sklearn.linear_model import LinearRegression
model = LinearRegression()
# time.clock()统计某程序运行速度,两次调用时间差为程序运行的CPU时间
start = time.clock()
model.fit(X_train,y_train)
train_score=model.score(X_train,y_train)
test_score=model.score(X_test,y_test)
print('elaspe: {0:.6f}; train score:{1:.6f}; test score:{2:.6f}'.format(time.clock()-start,train_score,test_score))
elaspe: 0.009990; train score:0.723941; test score:0.794958
可以看出,模型拟合效果一般,需要进行模型优化。
模型优化:
首先观察数据:
X[0]
array([6.320e-03, 1.800e+01, 2.310e+00, 0.000e+00, 5.380e-01, 6.575e+00,
6.520e+01, 4.090e+00, 1.000e+00, 2.960e+02, 1.530e+01, 3.969e+02,
4.980e+00])
特征数据的范围相差较大,最小的级别,最大的
级别,需要对数据进行归一化处理:
model = LinearRegression(normalize=True)
但是归一化处理只会加快算法收敛速度,优化算法训练效率,并不能提升算法准确度。
由于模型欠拟合,可以通过挖掘更多特征或增加多项式特征的方法优化,因此,使用多项式特征:
二阶多项式模型:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
def polynomial_model(degree=1):
polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)
linear_regression = LinearRegression(normalize=True)
pipeline=Pipeline([('polynomial_features',polynomial_features),('linear_regression',linear_regression)])
return pipeline
# 二阶多项式
model = polynomial_model(degree=2)
start = time.clock()
model.fit(X_train,y_train)
train_score = model.score(X_train,y_train)
test_score = model.score(X_test,y_test)
print('elaspe:{0: .6f}; train score:{1:.6f}; test score:{2: .6f}'.format(time.clock()-start,train_score,test_score))
elaspe: 0.036802; train score:0.930547; test score: 0.860465
可以看出训练分数和测试分数都有提高,说明模型得到很好的优化。
三阶多项式模型:
# 三阶多项式
model = polynomial_model(degree=3)
start = time.clock()
model.fit(X_train,y_train)
train_score = model.score(X_train,y_train)
test_score = model.score(X_test,y_test)
print('elaspe:{0: .6f}; train score:{1:.6f}; test score:{2: .6f}'.format(time.clock()-start,train_score,test_score))
elaspe: 0.090905; train score:1.000000; test score:-105.517016
可以看出训练分数为100%,而测试分数为负值,说明模型过拟合。
学习曲线:
cv = ShuffleSplit(n_splits=10,test_size=0.2,random_state=0)
title="Learning curve (degree={0})"
degrees=[1,2,3]
start=time.clock()
plt.figure(figsize=(18,4),dpi=200)
for i in range(len(degrees)):
plt.subplot(1,3,i+1)
plot_learning_curve(polynomial_model(degrees[i]),title.format(degrees[i]),X,y,ylim=(0.01,1.01),cv=cv)
print('elaspe:{0:.6f}'.format(time.clock()-start))
一阶多项式欠拟合,因为训练分数较低;三阶多项式过拟合,因为训练分数为1而测试分数无法看到;
二阶多项式拟合效果较好,但训练分数和测试分数间隙较大,说明训练样本数量不够。
参考:
黄永昌《scikit-learn机器学习》