CS231n——图像分类（KNN实现）

　　图像分类

目标：已有固定的分类标签集合，然后对于输入的图像，从分类标签集合中找出一个分类标签，最后把分类标签分配给该输入图像。

图像分类流程

输入：输入是包含N个图像的集合，每个图像的标签是K种分类标签中的一种。这个集合称为训练集。
学习：这一步的任务是使用训练集来学习每个类到底长什么样。一般该步骤叫做训练分类器或者学习一个模型。
评价：让分类器来预测它未曾见过的图像的分类标签，把分类器预测的标签和图像真正的分类标签对比，并以此来评价分类器的质量。

Nearest Neighbor分类器

数据集：CIFAR-10。这是一个非常流行的图像分类数据集，包含了60000张32X32的小图像。每张图像都有10种分类标签中的一种。这60000张图像被分为包含50000张图像的训练集和包含10000张图像的测试集。

CS231n——图像分类（KNN实现）

Nearest Neighbor图像分类思想：拿测试图片和训练集中每一张图片去比较，然后将它认为最相似的那个训练集图片的标签赋给这张测试图片。
如何比较来那个张图片？
在本例中，就是比较32x32x3的像素块。最简单的方法就是逐个像素比较，最后将差异值全部加起来。换句话说，就是将两张图片先转化为两个向量I_1和I_2，然后计算他们的L1距离：
CS231n——图像分类（KNN实现）
这里的求和是针对所有的像素。下面是整个比较流程的图例：

CS231n——图像分类（KNN实现）
计算向量间的距离有很多种方法，另一个常用的方法是L2距离，从几何学的角度，可以理解为它在计算两个向量间的欧式距离。L2距离的公式如下：

L1和L2比较：比较这两个度量方式是挺有意思的。在面对两个向量之间的差异时，L2比L1更加不能容忍这些差异。也就是说，相对于1个巨大的差异，L2距离更倾向于接受多个中等程度的差异。L1和L2都是在p-norm常用的特殊形式。

k-Nearest Neighbor分类器(KNN)

KNN图像分类思想：与其只找最相近的那1个图片的标签，我们找最相似的k个图片的标签，然后让他们针对测试图片进行投票，最后把票数最高的标签作为对测试图片的预测。
如何选择k值？
交叉验证：假如有1000张图片，我们将训练集平均分成5份，其中4份用来训练，1份用来验证。然后我们循环着取其中4份来训练，其中1份来验证，最后取所有5次验证结果的平均值作为算法验证结果。

CS231n——图像分类（KNN实现）
这就是5份交叉验证对k值调优的例子。针对每个k值，得到5个准确率结果，取其平均值，然后对不同k值的平均表现画线连接。本例中，当k=10的时算法表现最好（对应图中的准确率峰值）。如果我们将训练集分成更多份数，直线一般会更加平滑（噪音更少）。

k-Nearest Neighbor分类器的优劣

优点：

思路清晰，易于理解，实现简单；
算法的训练不需要花时间，因为其训练过程只是将训练集数据存储起来。

缺点：测试要花费大量时间计算，因为每个测试图像需要和所有存储的训练图像进行比较。

实际应用k-NN

如果你希望将k-NN分类器用到实处（最好别用到图像上，若是仅仅作为练手还可以接受），那么可以按照以下流程：

预处理你的数据：对你数据中的特征进行归一化（normalize），让其具有零平均值（zero mean）和单位方差（unit variance）。在后面的小节我们会讨论这些细节。本小节不讨论，是因为图像中的像素都是同质的，不会表现出较大的差异分布，也就不需要标准化处理了。
如果数据是高维数据，考虑使用降维方法，比如PCA(wiki ref, CS229ref, blog ref)或随机投影。
将数据随机分入训练集和验证集。按照一般规律，70%-90% 数据作为训练集。这个比例根据算法中有多少超参数，以及这些超参数对于算法的预期影响来决定。如果需要预测的超参数很多，那么就应该使用更大的验证集来有效地估计它们。如果担心验证集数量不够，那么就尝试交叉验证方法。如果计算资源足够，使用交叉验证总是更加安全的（份数越多，效果越好，也更耗费计算资源）。
在验证集上调优，尝试足够多的k值，尝试L1和L2两种范数计算方式。
如果分类器跑得太慢，尝试使用Approximate Nearest Neighbor库（比如FLANN）来加速这个过程，其代价是降低一些准确率。
对最优的超参数做记录。记录最优参数后，是否应该让使用最优参数的算法在完整的训练集上运行并再次训练呢？因为如果把验证集重新放回到训练集中（自然训练集的数据量就又变大了），有可能最优参数又会有所变化。在实践中，不要这样做。千万不要在最终的分类器中使用验证集数据，这样做会破坏对于最优参数的估计。直接使用测试集来测试用最优参数设置好的最优模型，得到测试集数据的分类准确率，并以此作为你的kNN分类器在该数据上的性能表现。

课程作业

KNN实现代码：

  1 import numpy as np
  2 
  3 #http://blog.csdn.net/geekmanong/article/details/51524402
  4 #http://www.cnblogs.com/daihengchen/p/5754383.html
  5 #http://blog.csdn.net/han784851198/article/details/53331104
  6 class KNearestNeighbor(object):
  7   """ a kNN classifier with L2 distance """
  8 
  9   def __init__(self):
 10     pass
 11 
 12   def train(self, X, y):
 13     """
 14     Train the classifier. For k-nearest neighbors this is just 
 15     memorizing the training data.
 16     训练分类器。对于KNN算法，此处只需要存储训练数据即可。
 17     Inputs:
 18     - X: A numpy array of shape (num_train, D) containing the training data
 19       consisting of num_train samples each of dimension D.
 20     - y: A numpy array of shape (N,) containing the training labels, where
 21          y[i] is the label for X[i].
 22     """
 23     self.X_train = X
 24     self.y_train = y
 25     
 26   def predict(self, X, k=1, num_loops=0):
 27     """
 28     Predict labels for test data using this classifier.
 29     基于该分类器，预测测试数据的标签分类。
 30     Inputs:
 31     - X: A numpy array of shape (num_test, D) containing test data consisting
 32          of num_test samples each of dimension D.测试数据集
 33     - k: The number of nearest neighbors that vote for the predicted labels.
 34     - num_loops: Determines which implementation to use to compute distances
 35       between training points and testing points.选择距离算法的实现方法
 36 
 37     Returns:
 38     - y: A numpy array of shape (num_test,) containing predicted labels for the
 39       test data, where y[i] is the predicted label for the test point X[i].  
 40     """
 41     if num_loops == 0:
 42       dists = self.compute_distances_no_loops(X)
 43     elif num_loops == 1:
 44       dists = self.compute_distances_one_loop(X)
 45     elif num_loops == 2:
 46       dists = self.compute_distances_two_loops(X)
 47     else:
 48       raise ValueError('Invalid value %d for num_loops' % num_loops)
 49 
 50     return self.predict_labels(dists, k=k)
 51 
 52   def compute_distances_two_loops(self, X):
 53     """
 54     Compute the distance between each test point in X and each training point
 55     in self.X_train using a nested loop over both the training data and the 
 56     test data.    两层循环计算L2距离
 57 
 58     Inputs:
 59     - X: A numpy array of shape (num_test, D) containing test data.
 60 
 61     Returns:
 62     - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
 63       is the Euclidean distance between the ith test point and the jth training
 64       point.
 65     """
 66     num_test = X.shape[0]
 67     num_train = self.X_train.shape[0]
 68     dists = np.zeros((num_test, num_train))
 69     for i in range(num_test):
 70       for j in range(num_train):
 71         #####################################################################
 72         # TODO:                                                             #
 73         # Compute the l2 distance between the ith test point and the jth    #
 74         # training point, and store the result in dists[i, j]. You should   #
 75         # not use a loop over dimension.                                    #
 76         #####################################################################
 77         test_row = X[i, :]
 78         train_row = self.X_train[j, :]
 79         dists[i, j] = np.sqrt(np.sum((test_row - train_row)**2))
 80     
 81     return dists
 82         #####################################################################
 83         #                       END OF YOUR CODE                            #
 84         #####################################################################
 85     return dists
 86 
 87   def compute_distances_one_loop(self, X):
 88     """
 89     Compute the distance between each test point in X and each training point
 90     in self.X_train using a single loop over the test data.   一层循环计算L2距离
 91 
 92     Input / Output: Same as compute_distances_two_loops
 93     """
 94     num_test = X.shape[0]
 95     num_train = self.X_train.shape[0]
 96     dists = np.zeros((num_test, num_train))
 97     for i in range(num_test):
 98       #######################################################################
 99       # TODO:                                                               #
100       # Compute the l2 distance between the ith test point and all training #
101       # points, and store the result in dists[i, :].                        #
102       #######################################################################
103       test_row = X[i, :]
104       dists[i,:] = np.sqrt(np.sum(test_row - self.X_train)**2) #numpy广播机制
105       #######################################################################
106       #                         END OF YOUR CODE                            #
107       #######################################################################
108     return dists
109 
110   def compute_distances_no_loops(self, X):
111     """
112     Compute the distance between each test point in X and each training point
113     in self.X_train using no explicit loops.    无循环计算L2距离
114 
115     Input / Output: Same as compute_distances_two_loops
116     """
117     num_test = X.shape[0]
118     num_train = self.X_train.shape[0]
119     dists = np.zeros((num_test, num_train)) 
120     #########################################################################
121     # TODO:                                                                 #
122     # Compute the l2 distance between all test points and all training      #
123     # points without using any explicit loops, and store the result in      #
124     # dists.                                                                #
125     #                                                                       #
126     # You should implement this function using only basic array operations; #
127     # in particular you should not use functions from scipy.                #
128     #                                                                       #
129     # HINT: Try to formulate the l2 distance using matrix multiplication    #
130     #       and two broadcast sums.                                         #
131     #########################################################################
132     X_sq = np.square(X).sum(axis=1)
133     X_train_sq = np.square(self.X_train).sum(axis=1)
134     dists = np.sqrt(-2*np.dot(X, self.X_train.T) + X_train_sq + np.matrix(X_sq).T)
135     dists = np.array(dists)
136     #########################################################################
137     #                         END OF YOUR CODE                              #
138     #########################################################################
139     return dists
140 
141   def predict_labels(self, dists, k=1):
142     """
143     Given a matrix of distances between test points and training points,
144     predict a label for each test point.
145 
146     Inputs:
147     - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
148       gives the distance betwen the ith test point and the jth training point.
149 
150     Returns:
151     - y: A numpy array of shape (num_test,) containing predicted labels for the
152       test data, where y[i] is the predicted label for the test point X[i].  
153     """
154     num_test = dists.shape[0]
155     y_pred = np.zeros(num_test)
156     for i in range(num_test):
157       # A list of length k storing the labels of the k nearest neighbors to
158       # the ith test point.
159       closest_y = []
160       #########################################################################
161       # TODO:                                                                 #
162       # Use the distance matrix to find the k nearest neighbors of the ith    #
163       # testing point, and use self.y_train to find the labels of these       #
164       # neighbors. Store these labels in closest_y.                           #
165       # Hint: Look up the function numpy.argsort.                             #
166       # numpy.argsort.函数返回的是数组值从小到大的索引值                                           #
167       #########################################################################
168       closest_y = self.y_train[np.argsort(dists[i,:])[:k]] 
169       #########################################################################
170       # TODO:                                                                 #
171       # Now that you have found the labels of the k nearest neighbors, you    #
172       # need to find the most common label in the list closest_y of labels.   #
173       # Store this label in y_pred[i]. Break ties by choosing the smaller     #
174       # label.                                                                #
175       #########################################################################
176       #np.bincount:统计每一个元素出现的次数
177       y_pred[i] = np.argmax(np.bincount(closest_y))
178       #########################################################################
179       #                           END OF YOUR CODE                            # 
180       #########################################################################
181 
182     return y_pred
183 
184 k_nearest_neighbor.py

k_nearest_neighbor.py