基于密度的聚类算法：DBSCAN

DBSCAN(Density-Based Spatial Clustering of Applications with Noise)是一个比较有代表性的基于密度的聚类算法。与划分和层次聚类方法不同，它将簇定义为密度相连的点的最大集合，能够把具有足够高密度的区域划分为簇，并可在噪声的空间数据库中发现任意形状的聚类。

DBSCAN中的几个定义：

Ε邻域：给定对象半径为Ε内的区域称为该对象的Ε邻域；

核心对象：如果给定对象Ε领域内的样本点数大于等于MinPts，则称该对象为核心对象；

直接密度可达：对于样本集合D，如果样本点q在p的Ε领域内，并且p为核心对象，那么对象q从对象p直接密度可达。

密度可达：对于样本集合D，给定一串样本点p1,p2….pn，p= p1,q= pn,假如对象pi从pi-1直接密度可达，那么对象q从对象p密度可达。

密度相连：存在样本集合D中的一点o，如果对象o到对象p和对象q都是密度可达的，那么p和q密度相联。

可以发现，密度可达是直接密度可达的传递闭包，并且这种关系是非对称的。密度相连是对称关系。DBSCAN目的是找到密度相连对象的最大集合。

Eg: 假设半径Ε=3，MinPts=3，点p的E领域中有点{m,p,p1,p2,o}, 点m的E领域中有点{m,q,p,m1,m2},点q的E领域中有点{q,m},点o的E领域中有点{o,p,s},点s的E领域中有点{o,s,s1}.

那么核心对象有p,m,o,s(q不是核心对象，因为它对应的E领域中点数量等于2，小于MinPts=3)；

点m从点p直接密度可达，因为m在p的E领域内，并且p为核心对象；

点q从点p密度可达，因为点q从点m直接密度可达，并且点m从点p直接密度可达；

点q到点s密度相连，因为点q从点p密度可达，并且s从点p密度可达。

DBSCAN算法描述:

输入: 包含n个对象的数据库，半径e，最少数目MinPts;

输出:所有生成的簇，达到密度要求。

(1)Repeat

(2)从数据库中抽出一个未处理的点；

(3)IF抽出的点是核心点 THEN 找出所有从该点密度可达的对象，形成一个簇；

(4)ELSE 抽出的点是边缘点(非核心对象)，跳出本次循环，寻找下一个点；

(5)UNTIL 所有的点都被处理。

DBSCAN对用户定义的参数很敏感，细微的不同都可能导致差别很大的结果，而参数的选择无规律可循，只能靠经验确定。

好处

1. 与K-means方法相比，DBSCAN不需要事先知道要形成的簇类的数量。

2. 与K-means方法相比，DBSCAN可以发现任意形状的簇类。

3. 同时，DBSCAN能够识别出噪声点。

4.DBSCAN对于数据库中样本的顺序不敏感，即Pattern的输入顺序对结果的影响不大。但是，对于处于簇类之间边界样本，可能会根据哪个簇类优先被探测到而其归属有所摆动。

缺点

1. DBScan不能很好反映高维数据。

2. DBScan不能很好反映数据集以变化的密度。

代码

# coding=utf-8

from numpy import *
import matplotlib.pyplot as plt

from matplotlib.pyplot import *
from collections import defaultdict
import random


class DbScan(object):

    def show(self,data,color=None):
        if not color:
            color=\'green\'
        group=self.createDataSet()
        fig = plt.figure(1)
        axes = fig.add_subplot(111)
        axes.scatter(group[:, 0], group[:, 1], s=40, c=\'red\')
        axes.scatter(data[:, 0], data[:, 1], s=50, c=color)
        plt.show()

    def createDataSet(self):
        group = [[1.0, 1.1], [1.0, 1.0],
                       [0, 0], [0, 0.1],
                       [2, 1.0], [2.1, 0.9],
                       [0.3, 0.0], [1.1, 0.9],
                       [2.2, 1.0], [2.1, 0.8],
                       [3.3, 3.5], [2.1, 0.9],
                       [2, 1.0], [2.1, 0.9],
                       [3.5, 3.4], [3.6, 3.5]
                       ]
        return group

    def dist(self,p1, p2):
       return ((p1[0]-p2[0])**2+ (p1[1]-p2[1])**2)**(0.5)

    def db_scan(self):
        all_points = self.createDataSet()
        E = 0.3
        minPts = 2
        # find out the core points
        other_points = []
        core_points = []  # 核心点集合
        plotted_points = []  # 使用到的点   非噪声点
        for point in all_points:
            point.append(0)  # 在点的后面加上第三维度类别，初始类别为 0
            total = 0
            for otherPoint in all_points:
                distance = self.dist(otherPoint, point)  # 遍历其他点并计算距离
                if distance <= E:
                    total += 1  # 计算当前点的e领域内点的个数

            if total > minPts:
                core_points.append(point)  # 是核心点，添加到列表core
                plotted_points.append(point)  # 将核心点添加到列表 plotted
            else:
                other_points.append(point)  # 不是核心点，添加到其他点

                # find border points

        border_points = []
        for core in core_points:  # 遍历核心点
            for other in other_points:  # 遍历非核心点
                if self.dist(core, other) <= E:
                    border_points.append(other)  # 添加到非噪声点集合
                    plotted_points.append(other)  # 添加到非噪声点集合
                    # implement the algorithm
        cluster_label = 0

        for point in core_points:  # 遍历核心点
            if point[2] == 0:  # 核心点所属类别为0
                cluster_label += 1
                point[2] = cluster_label  # 每遍历一个核心点，类别栏就加1

            for point2 in plotted_points:  # 遍历非噪声点
                distance = self.dist(point2, point)
                if point2[2] == 0 and distance <= E:  # 非噪声点的类别为0 并且与核心点的距离小于e

                    point2[2] = point[2]  # 将核心点的类别赋值给非噪声点
                    # print point, point2

                    # after the points are asssigned correnponding labels, we group them
        cluster_list = defaultdict(lambda: [[], []])  # 定义一个字典，默认值是包含两个列表的列表
        for point in plotted_points:
            cluster_list[point[2]][0].append(point[0])  # 类别为键，值得第一个列表是非噪声点的x
            cluster_list[point[2]][1].append(point[1])  # 类别为键，值得第一个列表是非噪声点的y

        markers = [\'+\', \'*\', \'.\', \'d\', \'^\', \'v\', \'>\', \'<\', \'p\']
        #
        # plotting the clusters
        i = 0
        print cluster_list
        for value in cluster_list:
            cluster = cluster_list[value]
            plot(cluster[0], cluster[1], markers[i])
            i = i % 10 + 1

        # plot the noise points as well
        noise_points = []
        for point in all_points:
            if not point in core_points and not point in border_points:
                noise_points.append(point)
        noisex = []
        noisey = []
        for point in noise_points:
            noisex.append(point[0])
            noisey.append(point[1])
        plot(noisex, noisey, "x")
        # /#
        # title(str(len(cluster_list)) + " clusters created with E =" + str(E) + " Min Points=" + str(
        #     minPts) + " total points=" + str(len(all_points)) + " noise Points = " + str(len(noise_points)))
        axis((-1, 5, -1, 5))
        show()

    def start(self):
        self.db_scan()

p=DbScan()
p.start()

　　dbscan 在2维上效果要优于kmeans，不过各有利弊，一个是靠数据密度做输入，一个是靠分类类别做输入