以《机器学习实战为例》编程实现AdaBoost算法。
1.AdaBoost算法原理
Boosting算法主要基于多个弱学习器来构建强学习器,最终结果由多个弱学习器的加权平均决定,每个基学习器的权重并不相同,每个权重代表对应分类器在上一轮的迭代中成功度。训练中的每个样本都有一个权重,基于上一次分类器分类结果,分类正确样本权重会降低,分类错误样本权重会升高。
错误率的计算,未正确分类样本占总样本的比例
每个分类器的权重迭代公式:
整体样本分布的迭代:
2.基分类器的构建--单层决策树
给定一组样本数据(xi, yi)xi={xi1,xi2....xid}样本xi共有d个属性,每种属性值一定的变化范围,目标是找到xi中某一种属性在某一个属性值上的某种逻辑判断,对于样本整体进行分类,而导致样本整体的加权错误率最小。
def buildStump(dataArr,classLabels,D):
#数据样本 类别列向量
dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
m,n = shape(dataMatrix)
#迭代步长
numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))
minError = inf #init error sum, to +infinity
#遍历每个属性特征
for i in range(n):#loop over all dimensions
#属性值的变化范围
rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
#每一步长属性值的变化长度
stepSize = (rangeMax-rangeMin)/numSteps
#迭代,遍历每一种可能的属性阈值
for j in range(-1,int(numSteps)+1):#loop over all range in current dimension
#判断逻辑条件 大于或者小于
for inequal in ['lt', 'gt']: #go over less than and greater than
#属性阈值
threshVal = (rangeMin + float(j) * stepSize)
#预测分类
predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
#每个样本的分类错误情况
errArr = mat(ones((m,1)))
errArr[predictedVals == labelMat] = 0
#加权错误率
weightedError = D.T*errArr #calc total error multiplied by D
#print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)
#记录最小的单层分类方式
if weightedError < minError:
minError = weightedError
bestClasEst = predictedVals.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump,minError,bestClasEst
def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
#返回样本集在该种属性值以及逻辑条件下,分类情况判断
retArray = ones((shape(dataMatrix)[0],1))
if threshIneq == 'lt':
retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:,dimen] > threshVal] = -1.0
return retArray
加载数据集:并绘出点的随机分布
def loadSimpData():
datMat = matrix([[ 1. , 2.1],
[ 2. , 1.1],
[ 1.3, 1. ],
[ 1. , 1. ],
[ 2. , 1. ]])
classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
return datMat,classLabels
def plotfig(d):
from matplotlib import pyplot as plt
xMat = mat(d)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:,0].flatten().A[0],xMat[:,1].flatten().A[0]) #注意flatten的用法
plt.show()
搜索加权分类错误率最小的单层决策树,以上图为例,即分别搜索x和y轴沿一定步长作为阈值,考虑大于小于两种判断条件进行分类,计算所有样本的加权分类错误率,从而选择最优属性,属性值以及逻辑判断等。
最优的属性是第一列 属性阈值1.3 逻辑判断为lt
3.AdaBoost算法实现
def adaBoostTrainDS(dataArr,classLabels,numIt=40):
#存放弱分类器
weakClassArr = []
m = shape(dataArr)[0]
#初始样本权重
D = mat(ones((m,1))/m) #init D to all equal
#每个样本的类别估计累记值
aggClassEst = mat(zeros((m,1)))
#每一次迭代
for i in range(numIt):
#返回最优单层决策树
bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
#print "D:",D.T
#计算分类器的权重更新式
alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0
bestStump['alpha'] = alpha
#保存
weakClassArr.append(bestStump) #store Stump Params in Array
#print "classEst: ",classEst.T
expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
D = multiply(D,exp(expon)) #Calc New D for next iteration
D = D/D.sum()
#calc training error of all classifiers, if this is 0 quit for loop early (use break)
aggClassEst += alpha*classEst
#print "aggClassEst: ",aggClassEst.T
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
errorRate = aggErrors.sum()/m
print ("total error: ",errorRate)
if errorRate == 0.0: break
return weakClassArr,aggClassEst
初始样本的权重设为相等1/m
对于每一次迭代:
首先计算加权错误率最小的最优单层决策树 返回决策树信息 最小错误率 以及类别预测值
计算分类器的更新权重alpha
更新样本的权重向量D
计算所有样本类别的加权累加值
计算分类错误率 若全部分类正确则返回。
观察样本权重向量D的变化情况,每个样本的估计累计值aggClassEst,每个弱分类器存储在dict中,key分别为dim thresh ineq alpha
4.AdaBoost分类测试:
获取到每个弱分类器的分类结果以及相对应的权重,进行加权平均,然后经过sign函数转换为类别-1,+1
def adaClassify(datToClass,classifierArr):
dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
m = shape(dataMatrix)[0]
aggClassEst = mat(zeros((m,1)))
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\
classifierArr[i]['thresh'],\
classifierArr[i]['ineq'])#call stump classify
aggClassEst += classifierArr[i]['alpha']*classEst
print (aggClassEst)
return sign(aggClassEst)
输入数据示例:
加载数据集 前n-1列转换为数据矩阵 最后一列转换为类别列向量
def loadDataSet(fileName): #general function to parse tab -delimited floats
numFeat = len(open(fileName).readline().split('\t')) #get number of fields
dataMat = []; labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat-1):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat,labelMat
y以以上数据为例,测试弱分类器的个数对于训练以及测试错误率的影响。
修改代码:
def adaBoostTrainDS(dataArr,classLabels,numIt=40):
#存放弱分类器
weakClassArr = []
m = shape(dataArr)[0]
#初始样本权重
D = mat(ones((m,1))/m) #init D to all equal
#每个样本的类别估计累记值
aggClassEst = mat(zeros((m,1)))
#每一次迭代
for i in range(numIt):
#返回最优单层决策树
bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
#print ("D:",D.T)
#计算分类器的权重更新式
alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0
bestStump['alpha'] = alpha
#保存
weakClassArr.append(bestStump) #store Stump Params in Array
#print ("classEst: ",classEst.T)
expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
D = multiply(D,exp(expon)) #Calc New D for next iteration
D = D/D.sum()
#calc training error of all classifiers, if this is 0 quit for loop early (use break)
aggClassEst += alpha*classEst
#print ("aggClassEst: ",aggClassEst.T)
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
errorRate = aggErrors.sum()/m
#print ("total error: ",errorRate)
if errorRate == 0.0: break
print ("train error: ",errorRate)
return weakClassArr,aggClassEst
def adaClassify(datToClass,classifierArr):
dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
m = shape(dataMatrix)[0]
aggClassEst = mat(zeros((m,1)))
#print ("lenclassifierArr: ",len(classifierArr))
for i in range(len(classifierArr[0])):
classEst = stumpClassify(dataMatrix,classifierArr[0][i]['dim'],\
classifierArr[0][i]['thresh'],\
classifierArr[0][i]['ineq'])#call stump classify
aggClassEst += classifierArr[0][i]['alpha']*classEst
#print (aggClassEst)
return sign(aggClassEst)
def test(numIt):
dt, lt = loadDataSet('horseColicTraining2.txt')
cft = adaBoostTrainDS(dt, lt, numIt)
testd, testl = loadDataSet('horseColicTest2.txt')
pn = adaClassify(testd, cft)
errArr = mat(ones((67, 1)))
print("test error", errArr[pn!=mat(testl).T].sum()/67)
训练时,只输出迭代停止时最后一次训练错误率
判断类别时程序需要修改,弱学习器保存在classifierArr字典数组中,学习器的个数len(classifierArr[0]),学习器的各种属性取值:classifierArr[0][i]['thresh'] classifierArr[0][i]['ineq']) classifierArr[0][i]['alpha']
示例代码报错信息如下:
修改取值方式即可解决。
分别修改迭代次数1,5,10,30,50,80,100,200
可知随着弱学习器的数量增加,训练错误率一直减小,测试错误率先减少又增加,当num=50时,测试错误率达到最优值。
原因:随着弱学习器数目增多,会出现过拟合现象。