（数据挖掘-入门-6）十折交叉验证和K近邻

主要内容：

1、十折交叉验证

2、混淆矩阵

3、K近邻

4、python实现

一、十折交叉验证

前面提到了数据集分为训练集和测试集，训练集用来训练模型，而测试集用来测试模型的好坏，那么单一的测试是否就能很好的衡量一个模型的性能呢？

答案自然是否定的，单一的测试集具有偶然性和随机性。因此本文介绍一种衡量模型（比如分类器）性能的方法——十折交叉验证(10-fold cross validation)

什么是十折交叉验证？

假设有个数据集，需要建立一个分类器，如何验证分类器的性能呢？

将数据集随机均为为10份，依次选择某1份作为测试集，其他9份作为训练集，训练出来的模型对测试集进行分类，并统计分类结果，就这样，重复10次实验，综合所有分类结果，就可以得到比较稳定的评价结果（当然，由于是随机划分数据集，因此每次运行结果都不一致）。

附：当然也可以选择k折交叉验证，最极端的就是留1交叉验证，每次只留一个样本做测试集，但这样的计算规模太大。

二、混淆矩阵

混淆矩阵：confuse matrix

假设有n个类别，那么分类结果的统计可以通过一个n*n的矩阵来表示，即混淆矩阵。

对角线即为分类正确的样本数。

（数据挖掘-入门-6）十折交叉验证和K近邻

三、K近邻（KNN）

在协同过滤中已经提到过K近邻，就是选择离某个样本最近的K个样本，根据该K个样本来决定此样本的数值或类别。

如果是连续数值，那么K近邻可以作为回归方法，通过K个样本的矩阵权重来拟合数值；

如果是离散数值，那么K近邻可以作为分类方法，通过K个样本的多数投票策略来决定类别；

四、python实现

数据集：

mpgData.zip

pimaSmall.zip

pima.zip

代码：

1、切分数据

# divide data into 10 buckets
import random

def buckets(filename, bucketName, separator, classColumn):
    """the original data is in the file named filename
    bucketName is the prefix for all the bucket names
    separator is the character that divides the columns
    (for ex., a tab or comma and classColumn is the column
    that indicates the class"""

    # put the data in 10 buckets
    numberOfBuckets = 10
    data = {}
    # first read in the data and divide by category
    with open(filename) as f:
        lines = f.readlines()
    for line in lines:
        if separator != '\t':
            line = line.replace(separator, '\t')
        # first get the category
        category = line.split()[classColumn]
        data.setdefault(category, [])
        data[category].append(line)
    # initialize the buckets
    buckets = []
    for i in range(numberOfBuckets):
        buckets.append([])       
    # now for each category put the data into the buckets
    for k in data.keys():
        #randomize order of instances for each class
        random.shuffle(data[k])
        bNum = 0
        # divide into buckets
        for item in data[k]:
            buckets[bNum].append(item)
            bNum = (bNum + 1) % numberOfBuckets

    # write to file
    for bNum in range(numberOfBuckets):
        f = open("%s-%02i" % (bucketName, bNum + 1), 'w')
        for item in buckets[bNum]:
            f.write(item)
        f.close()

# example of how to use this code          
buckets("pimaSmall.txt", 'pimaSmall',',',8)

View Code