在没有 sklearn 的情况下从数据构建混淆矩阵答案

【问题标题】：Constructing a confusion matrix from data without sklearn在没有 sklearn 的情况下从数据构建混淆矩阵
【发布时间】：2020-07-26 08:12:15
【问题描述】：

我正在尝试在不使用 sklearn 库的情况下构建混淆矩阵。我无法正确形成混淆矩阵。这是我的代码：

def comp_confmat():
    currentDataClass = [1,3,3,2,5,5,3,2,1,4,3,2,1,1,2]    
    predictedClass = [1,2,3,4,2,3,3,2,1,2,3,1,5,1,1]
    cm = []
    classes = int(max(currentDataClass) - min(currentDataClass)) + 1 #find number of classes

    for c1 in range(1,classes+1):#for every true class
        counts = []
        for c2 in range(1,classes+1):#for every predicted class
            count = 0
            for p in range(len(currentDataClass)):
                if currentDataClass[p] == predictedClass[p]:
                    count += 1
            counts.append(count)
        cm.append(counts)
    print(np.reshape(cm,(classes,classes)))

但是这会返回：

[[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]]

但是我不明白为什么每次重置计数时每次迭代都会导致 7 并且它循环不同的值？

这是我应该得到的（使用sklearn的confusion_matrix函数）：

[[3 0 0 0 1]
[2 1 0 1 0]
[0 1 3 0 0]
[0 1 0 0 0]
[0 1 1 0 0]]

【问题讨论】：

标签： python machine-learning confusion-matrix

【解决方案1】：

您可以通过计算实际类和预测类的每个组合中的实例数来导出混淆矩阵，如下所示：

import numpy as np

def comp_confmat(actual, predicted):

    # extract the different classes
    classes = np.unique(actual)

    # initialize the confusion matrix
    confmat = np.zeros((len(classes), len(classes)))

    # loop across the different combinations of actual / predicted classes
    for i in range(len(classes)):
        for j in range(len(classes)):

           # count the number of instances in each combination of actual / predicted classes
           confmat[i, j] = np.sum((actual == classes[i]) & (predicted == classes[j]))

    return confmat

# sample data
actual = [1, 3, 3, 2, 5, 5, 3, 2, 1, 4, 3, 2, 1, 1, 2]
predicted = [1, 2, 3, 4, 2, 3, 3, 2, 1, 2, 3, 1, 5, 1, 1]

# confusion matrix
print(comp_confmat(actual, predicted))
# [[3. 0. 0. 0. 1.]
#  [2. 1. 0. 1. 0.]
#  [0. 1. 3. 0. 0.]
#  [0. 1. 0. 0. 0.]
#  [0. 1. 1. 0. 0.]]

【讨论】：

【解决方案2】：

在你的最里面的循环中，应该有一个大小写区别：目前这个循环计数协议，但你只希望它实际上是c1 == c2。

这是另一种方式，使用嵌套列表推导：

currentDataClass = [1,3,3,2,5,5,3,2,1,4,3,2,1,1,2]    
predictedClass = [1,2,3,4,2,3,3,2,1,2,3,1,5,1,1]

classes = int(max(currentDataClass) - min(currentDataClass)) + 1 #find number of classes

counts = [[sum([(currentDataClass[i] == true_class) and (predictedClass[i] == pred_class) 
                for i in range(len(currentDataClass))])
           for pred_class in range(1, classes + 1)] 
           for true_class in range(1, classes + 1)]
counts

[[3, 0, 0, 0, 1],
 [2, 1, 0, 1, 0],
 [0, 1, 3, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 1, 1, 0, 0]]

【讨论】：

【解决方案3】：

这是我使用 numpy 和 pandas 的解决方案：

import numpy as np
import pandas as pd

true_classes = [1, 3, 3, 2, 5, 5, 3, 2, 1, 4, 3, 2, 1, 1, 2]
predicted_classes = [1, 2, 3, 4, 2, 3, 3, 2, 1, 2, 3, 1, 5, 1, 1]

classes = set(true_classes)
number_of_classes = len(classes)

conf_matrix = pd.DataFrame(
    np.zeros((number_of_classes, number_of_classes),dtype=int),
    index=classes,
    columns=classes)

for true_label, prediction in zip(true_classes ,predicted_classes):
    # Each pair of (true_label, prediction) is a position in the confusion matrix (row, column)
    # Basically here we are counting how many times we have each pair.
    # The counting will be placed at the matrix index (true_label/row, prediction/column)
 
    conf_matrix.loc[true_label, prediction] += 1

print(conf_matrix.values)

[[3 0 0 0 1]
 [2 1 0 1 0]
 [0 1 3 0 0]
 [0 1 0 0 0]
 [0 1 1 0 0]]

【讨论】：

太棒了，你能解释一下你的 for 循环部分是如何工作的吗？
嗨@DarkstarDream，更新了对变量的更好描述和for循环中的一些cmets。告诉我你是否理解...
是的，有道理，谢谢你帮助我