Chapter 3 Classfication
主要讲分类问题的性能度量选择。
- SGDClassfier 适合处理大数据集和在线学习This classifier has the advantage of being capable of handling very large datasets efficiently. This is in part because SGD deals with training instances independently, one at a time (which also makes SGD well suited for online learning).
性能度量:
- accuracy 精度对于分类问题来说通常不是好的度量,特别是样本不均衡的情况。比如[0,0,0,0,1],分类器只需要都预测为0,即有80%的精度。
- confusion matrix:A much better way to evaluate the performance of a classifier is to look at the confusion matrix.
- Precision 精确度 and Recall: p r e c i s i o n = T P T P + F P precision=\frac{TP}{TP+FP} precision=TP+FPTP r e c a l l = T P T P + F N recall=\frac{TP}{TP+FN} recall=TP+FNTP
- F 1 F_1 F1 score: 可以用来快速比较2个分类器。It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high
- PR曲线下方面积大的好
-
ROC曲线下方面积大的好。
真 正 例 率 ( 灵 敏 度 , r e c a l l ) T P R = T P T P + F N 真正例率(灵敏度,recall)TPR=\frac{TP}{TP+FN} 真正例率(灵敏度,recall)TPR=TP+FNTP
假 正 例 率 ( 假 阳 率 ) F P R = F P F P + T N 假正例率(假阳率)FPR=\frac{FP}{FP+TN} 假正例率(假阳率)FPR=FP+TNFP
Precision/Recall Tradeoff
- Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff. 查准率和查全率是一对矛盾的度量。一般来说,查准率高时,查全率往往偏低,反之亦然。例如,若希望将好瓜尽可能多地选出来,则可通过增加选瓜的数量来实现,如果将所有瓜都选上,那么所有好瓜也必然都选上了,但这样查准率会较低;若希望选出的瓜中好瓜的比例尽可能高,则可以挑选最有把握的瓜,但这样就难免漏掉很多好瓜,使查全率较低。通常只有在一些简单任务中,才可能使二者都很高。
- sklearn的分类器都可以输出“分类把握”,有的是decision_function(),有的是 predict_proba()。然后有一个threshold,大于threshold的分为正类,小于的分为反类。调高threshold会提高灵敏度(recall,真正例率),降低假阳性。反之则反。
- 什么时候用 ROC,什么时候用 Precision-Recall 呢?
一般来说样本不均衡的情况用 Precision-Recall 比较好。反之,那 ROC 就更常用一些。 - 为什么查准率在接近1时会抖动?
答:逐渐提高threshold时,总的来说查准率会提高,但在局部有时会下降。比如下图中把中间的threshold往右移一个,查准率4/5->3/4PR曲线 Precision and recall versus the decision threshold