Chapter 3 Classfication

主要讲分类问题的性能度量选择。

  1. SGDClassfier 适合处理大数据集和在线学习This classifier has the advantage of being capable of handling very large datasets efficiently. This is in part because SGD deals with training instances independently, one at a time (which also makes SGD well suited for online learning).

性能度量:

  • accuracy 精度对于分类问题来说通常不是好的度量,特别是样本不均衡的情况。比如[0,0,0,0,1],分类器只需要都预测为0,即有80%的精度。
  • confusion matrix:A much better way to evaluate the performance of a classifier is to look at the confusion matrix.
  • Precision 精确度 and Recall: p r e c i s i o n = T P T P + F P precision=\frac{TP}{TP+FP} precision=TP+FPTP r e c a l l = T P T P + F N recall=\frac{TP}{TP+FN} recall=TP+FNTP
  • F 1 F_1 F1 score: 可以用来快速比较2个分类器。It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high
  • PR曲线下方面积大的好
  • ROC曲线下方面积大的好。
    真 正 例 率 ( 灵 敏 度 , r e c a l l ) T P R = T P T P + F N 真正例率(灵敏度,recall)TPR=\frac{TP}{TP+FN} ,recallTPR=TP+FNTP
    假 正 例 率 ( 假 阳 率 ) F P R = F P F P + T N 假正例率(假阳率)FPR=\frac{FP}{FP+TN} FPR=FP+TNFP

Precision/Recall Tradeoff

  • Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff. 查准率和查全率是一对矛盾的度量。一般来说,查准率高时,查全率往往偏低,反之亦然。例如,若希望将好瓜尽可能多地选出来,则可通过增加选瓜的数量来实现,如果将所有瓜都选上,那么所有好瓜也必然都选上了,但这样查准率会较低;若希望选出的瓜中好瓜的比例尽可能高,则可以挑选最有把握的瓜,但这样就难免漏掉很多好瓜,使查全率较低。通常只有在一些简单任务中,才可能使二者都很高。
  • sklearn的分类器都可以输出“分类把握”,有的是decision_function(),有的是 predict_proba()。然后有一个threshold,大于threshold的分为正类,小于的分为反类。调高threshold会提高灵敏度(recall,真正例率),降低假阳性。反之则反。
  • 什么时候用 ROC,什么时候用 Precision-Recall 呢?
    一般来说样本不均衡的情况用 Precision-Recall 比较好。反之,那 ROC 就更常用一些。
  • 为什么查准率在接近1时会抖动?
    PR曲线 Precision and recall versus the decision threshold
    chapter 3 Classfication chapter 3 Classfication
    答:逐渐提高threshold时,总的来说查准率会提高,但在局部有时会下降。比如下图中把中间的threshold往右移一个,查准率4/5->3/4
    chapter 3 Classfication

相关文章: