chapter 3 Classfication

Chapter 3 Classfication

主要讲分类问题的性能度量选择。

SGDClassfier 适合处理大数据集和在线学习This classifier has the advantage of being capable of handling very large datasets efficiently. This is in part because SGD deals with training instances independently, one at a time (which also makes SGD well suited for online learning).

accuracy 精度对于分类问题来说通常不是好的度量，特别是样本不均衡的情况。比如[0,0,0,0,1]，分类器只需要都预测为0，即有80%的精度。
confusion matrix：A much better way to evaluate the performance of a classifier is to look at the confusion matrix.
Precision 精确度 and Recall: p r e c i s i o n = T P T P + F P precision=\frac{TP}{TP+FP} precision=TP+FPTP r e c a l l = T P T P + F N recall=\frac{TP}{TP+FN} recall=TP+FNTP
F 1 F_1 F1 score: 可以用来快速比较2个分类器。It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high
PR曲线下方面积大的好
ROC曲线下方面积大的好。
真正例率（灵敏度 , r e c a l l ） T P R = T P T P + F N 真正例率（灵敏度,recall）TPR=\frac{TP}{TP+FN} 真正例率（灵敏度,recall）TPR=TP+FNTP
假正例率（假阳率） F P R = F P F P + T N 假正例率（假阳率）FPR=\frac{FP}{FP+TN} 假正例率（假阳率）FPR=FP+TNFP

Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff. 查准率和查全率是一对矛盾的度量。一般来说，查准率高时，查全率往往偏低，反之亦然。例如，若希望将好瓜尽可能多地选出来，则可通过增加选瓜的数量来实现，如果将所有瓜都选上，那么所有好瓜也必然都选上了，但这样查准率会较低；若希望选出的瓜中好瓜的比例尽可能高，则可以挑选最有把握的瓜，但这样就难免漏掉很多好瓜，使查全率较低。通常只有在一些简单任务中，才可能使二者都很高。
sklearn的分类器都可以输出“分类把握”，有的是decision_function()，有的是 predict_proba()。然后有一个threshold,大于threshold的分为正类，小于的分为反类。调高threshold会提高灵敏度（recall,真正例率），降低假阳性。反之则反。
什么时候用 ROC，什么时候用 Precision-Recall 呢？
一般来说样本不均衡的情况用 Precision-Recall 比较好。反之，那 ROC 就更常用一些。
为什么查准率在接近1时会抖动？

PR曲线 Precision and recall versus the decision threshold

答：逐渐提高threshold时，总的来说查准率会提高，但在局部有时会下降。比如下图中把中间的threshold往右移一个，查准率4/5->3/4