KNN算法与Kd树 - 爱码网

最近邻法和k-近邻法

　　下面图片中只有三种豆，有三个豆是未知的种类，如何判定他们的种类？

KNN算法与Kd树

　　提供一种思路，即：未知的豆离哪种豆最近就认为未知豆和该豆是同一种类。由此，我们引出最近邻算法的定义：为了判定未知样本的类别，以全部训练样本作为代表点，计算未知样本与所有训练样本的距离，并以最近邻者的类别作为决策未知样本类别的唯一依据。但是，最近邻算法明显是存在缺陷的，比如下面的例子：有一个未知形状(图中绿色的圆点)，如何判断它是什么形状？

KNN算法与Kd树

　　显然，最近邻算法的缺陷——对噪声数据过于敏感，为了解决这个问题，我们可以可以把未知样本周边的多个最近样本计算在内，扩大参与决策的样本量，以避免个别数据直接决定决策结果。由此，我们引进K-最近邻算法。K-最近邻算法是最近邻算法的一个延伸。基本思路是：选择未知样本一定范围内确定个数的K个样本，该K个样本大多数属于某一类型，则未知样本判定为该类型。如何选择一个最佳的K值取决于数据。一般情况下，在分类时较大的K值能够减小噪声的影响，但会使类别之间的界限变得模糊。待测样本（绿色圆圈）既可能分到红色三角形类，也可能分到蓝色正方形类。如果k取3，从图可见，待测样本的3个邻居在实线的内圆里，按多数投票结果，它属于红色三角形类。但是如果k取5，那么待测样本的最邻近的5个样本在虚线的圆里，按表决法，它又属于蓝色正方形类。在实际应用中，K先取一个比较小的数值，再采用交叉验证法来逐步调整K值，最终选择适合该样本的最优的K值。

KNN算法实现

算法基本步骤：

1）计算待分类点与已知类别的点之间的距离

2）按照距离递增次序排序

3）选取与待分类点距离最小的k个点

4）确定前k个点所在类别的出现次数

5）返回前k个点出现次数最高的类别作为待分类点的预测分类

　　下面是一个按照算法基本步骤用python实现的简单例子，根据已分类的4个样本点来预测未知点(图中的灰点)的分类：

KNN算法与Kd树

from numpy import *  

# create a dataset which contains 4 samples with 2 classes  
def createDataSet():  
    # create a matrix: each row as a sample  
    group = array([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]])  
    labels = ['A', 'A', 'B', 'B'] # four samples and two classes  
    return group, labels

# classify using kNN (k Nearest Neighbors )  
# Input:      newInput: 1 x N
#             dataSet:  M x N (M samples N, features)
#             labels:   1 x M   
#             k: number of neighbors to use for comparison  
# Output:     the most popular class label   
def kNNClassify(newInput, dataSet, labels, k):  
    numSamples = dataSet.shape[0] # shape[0] stands for the num of row  
  
    ## step 1: calculate Euclidean distance  
    # tile(A, reps): Construct an array by repeating A reps times  
    # the following copy numSamples rows for dataSet  
    diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise  
    squaredDiff = diff ** 2 # squared for the subtract  
    squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row  
    distance = squaredDist ** 0.5  
  
    ## step 2: sort the distance  
    # argsort() returns the indices that would sort an array in a ascending order  
    sortedDistIndices = argsort(distance)  
  
    classCount = {} # define a dictionary (can be append element)  
    for i in xrange(k):  
        ## step 3: choose the min k distance  
        voteLabel = labels[sortedDistIndices[i]]  
  
        ## step 4: count the times labels occur  
        # when the key voteLabel is not in dictionary classCount, get()  
        # will return 0  
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1  
  
    ## step 5: the max voted class will return  
    maxCount = 0  
    for key, value in classCount.items():  
        if value > maxCount:  
            maxCount = value  
            maxIndex = key  
  
    return maxIndex   
    
    
if __name__== "__main__":    
    dataSet, labels = createDataSet()  
      
    testX = array([1.2, 1.0])  
    k = 3  
    outputLabel = kNNClassify(testX, dataSet, labels, 3)  
    print "Your input is:", testX, "and classified to class: ", outputLabel  
      
    testX = array([0.1, 0.3])  
    outputLabel = kNNClassify(testX, dataSet, labels, 3)  
    print "Your input is:", testX, "and classified to class: ", outputLabel

　　结果如下：
Your input is: [ 1.2 1. ] and classified to class: A
Your input is: [ 0.1 0.3] and classified to class: B

　　OpenCV中也提供了机器学习的相关算法，其中KNN算法的最基本例子如下

import numpy as np
import matplotlib.pyplot as plt
import cv2

# Feature set containing (x,y) values of 25 known/training data
trainData = np.random.randint(0,100,(25,2)).astype(np.float32)

# Labels each one either Red or Blue with numbers 0 and 1
responses = np.random.randint(0,2,(25,1)).astype(np.float32)

# Take Red families and plot them
red = trainData[responses.ravel()==0]
plt.scatter(red[:,0],red[:,1],80,'r','^')

# Take Blue families and plot them
blue = trainData[responses.ravel()==1]
plt.scatter(blue[:,0],blue[:,1],80,'b','s')

# Testing data
newcomer = np.random.randint(0,100,(1,2)).astype(np.float32)
plt.scatter(newcomer[:,0],newcomer[:,1],80,'g','o')

knn = cv2.KNearest()
knn.train(trainData,responses) # Trains the model
# Finds the neighbors and predicts responses for input vectors.
ret, results, neighbours ,dist = knn.find_nearest(newcomer, 3)

print "result: ", results,"\n"
print "neighbours: ", neighbours,"\n"
print "distance: ", dist
plt.show()

View Code