与k-means、LVQ用原型向量来刻画聚类结构不同,高斯混合(Mixture of Gaussian)聚类采用概率模型来表达聚类原型。

多元高斯分布的概率密度函数定义

(1)p(x)=1(2π)n2(Σ)12e12(xμ)TΣ1(xμ)p(x)=\frac{1}{(2\pi )^{\frac{n}{2}}(\Sigma )^{\frac{1}{2}}}e^{-\frac{1}{2}(x-\mu )^{T}\Sigma ^{-1}(x-\mu )} \tag{1}

其中n为样本x的维数,Σ\Sigma为样本集的协方差,μ\mu为n维均值向量,可以看出多元高斯分布只跟Σ\Sigmaμ\mu有关,为了后面方便表达,我们将概率密度函数表示为p(xμ,Σ)p(x |\mu ,\Sigma )

高斯混合分布

(2)pM(x)=i=1kαip(xμi,Σi)p_{M}(x)=\sum_{i=1}^{k}\alpha _{i}p(x|\mu _{i},\Sigma _{i}) \tag{2}

其中αi\alpha _{i}为第i个类别的“混合系数”,这里注意αi>1\alpha _{i}>1i=1kαi=1\sum_{i=1}^{k}\alpha _{i}=1,因为要保证概率密度函数的积分为1

假设样本集D是根据高斯混合分布生成的,令zj1,2,...,kz_{j}\in {1,2,...,k}表示生成的样本xjx_{j}的标签,则zjz_{j}的后验概率为
(3)pM(zj=ixj)=P(zj=i)pM(xjzj=i)pM(xj)=αip(xjμi,Σi)l=1kαlp(xjμl,Σl)p_{M}(z_{j}=i|x_{j})=\frac{P(z_{j}=i) p_{M}(x_{j}|z_{j}=i)}{p_{M}(x_{j})} \\ =\frac{\alpha _{i}p(x_{j}|\mu _{i},\Sigma _{i})}{\sum_{l=1}^{k}\alpha _{l}p(x_{j}|\mu _{l},\Sigma _{l})} \tag{3}

可以知道pM(zj=ixj)p_{M}(z_{j}=i|x_{j})给出了当样本为xjx_{j}时样本属于簇i的概率,因此最后预测样本类别时选择预测概率最大的簇,为了方便叙述,我们将后验概率写为γji(i=1,2,...,k)\gamma _{ji}(i=1,2,...,k)

模型求解

对于式(2)的最优解(αi,μi,Σii=1,2,...k)(\alpha_{i} ,\mu _{i},\Sigma _{i}|i=1,2,...k)我们用最大化对数似然函数来求解:
(4)LL(D)=ln(j=1mpM(xj))=j=1m(i=1kαip(xjμi,Σi))LL(D)=ln(\prod_{j=1}^{m}p_{M}(x_{j})) \\ =\sum_{j=1}^{m}(\sum_{i=1}^{k}\alpha _{i}p(x_{j}|\mu _{i},\Sigma _{i})) \tag{4}

(αi,μi,Σii=1,2,...k)(\alpha_{i} ,\mu _{i},\Sigma _{i}|i=1,2,...k)能使(4)最大化,则由LL(D)μi=0\frac{\partial LL(D)}{\partial \mu _{i}}=0可得:
(5)μi=j=1mγjixjj=1mγji\mu _{i}=\frac{\sum_{j=1}^{m}\gamma _{ji}x_{j}}{\sum_{j=1}^{m}\gamma _{ji}}\tag{5}
LL(D)Σi=0\frac{\partial LL(D)}{\partial \Sigma _{i}}=0可得:
(6)Σi=j=1mγji(xjμi)(xjμi)Tj=1mγji\Sigma _{i}=\frac{\sum_{j=1}^{m}\gamma _{ji}(x_{j}-\mu _{i})(x_{j}-\mu _{i})^{T}}{\sum_{j=1}^{m}\gamma _{ji}}\tag{6}
对于αi\alpha_{i},不仅要最大化LL(D),并且还要满足αi>1\alpha _{i}>1i=1kαi=1\sum_{i=1}^{k}\alpha _{i}=1,因此我们引入拉格朗日乘子:
LL(D)+λ(i=1kαi1)LL(D)+\lambda (\sum_{i=1}^{k}\alpha _{i}-1)
对上式对αi\alpha_{i}求导得0,两边同时乘以αi\alpha_{i},有:
j=1mγji+λαi=0\sum_{j=1}^{m}\gamma _{ji}+\lambda \alpha _{i}=0
对所有混合成份(k个)求和可得:
j=1mi=1kγji+i=1kαiλ=0\sum_{j=1}^{m}\sum_{i=1}^{k}\gamma _{ji}+\sum_{i=1}^{k}\alpha _{i}\lambda =0
由于i=1kγji=0\sum_{i=1}^{k}\gamma _{ji}=0,且 i=1kαi=1\sum_{i=1}^{k}\alpha _{i}=1,则
λ=m\lambda =-m
则有:
(7)αi=1mj=1mγji\alpha _{i}=\frac{1}{m}\sum_{j=1}^{m}\gamma _{ji}\tag{7}

由(5)(6)(7)我们可以使用EM算法:在每次迭代中,先根据当前参数计算每个样本属于各类得后验概率γji\gamma_{ji}(E步),然后根据新生成的后验概率更新3个参数,直到满足停止条件(最大迭代数或似然函数增长缓慢)
原型聚类(三)高斯混合聚类和python实现
(图来自《机器学习》周志华)

python3.6实现:

# -*- coding: gbk -*-

import numpy as np
import copy
from sklearn.datasets import make_moons
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt


class MoG():
    def __init__(self, k=3, max_iter=20, e=0.0001):
        self.k = k
        self.max_iter = max_iter
        self.e = e
        self.ll = 0

    def dist(self, x1, x2):
        return np.linalg.norm(x1 - x2)

    def get_miu(self, X):
        index = np.random.choice(X.shape[0], 1, replace=False)
        mius = []
        mius.append(X[index])

        for _ in range(self.k - 1):
            max_dist_index = 0
            max_distance = 0
            for j in range(X.shape[0]):
                min_dist_with_miu = 999999

                for miu in mius:
                    dist_with_miu = self.dist(miu, X[j])
                    if min_dist_with_miu > dist_with_miu:
                        min_dist_with_miu = dist_with_miu

                if max_distance < min_dist_with_miu:
                    max_distance = min_dist_with_miu
                    max_dist_index = j
            mius.append(X[max_dist_index])

        mius_array = np.array([])
        for i in range(self.k):
            if i == 0:
                mius_array = mius[i]
            else:
                mius[i] = mius[i].reshape(mius[0].shape)
                mius_array = np.append(mius_array, mius[i], axis=0)

        return mius_array

    def p(self, x, label):

        miu = self.mius_array[label].reshape(1, -1)

        covdet = np.linalg.det(self.Sigma[label])  # 计算|cov|
        covinv = np.linalg.inv(self.Sigma[label])  # 计算cov的逆

        if covdet < 1e-5:              # 以防行列式为0
            covdet = np.linalg.det(
                self.Sigma[label] + np.eye(x.shape[0]) * 0.001)
            covinv = np.linalg.inv(
                self.Sigma[label] + np.eye(x.shape[0]) * 0.001)
        a = np.float_power(
            2 * np.pi, x.shape[0] / 2) * np.float_power(covdet, 0.5)
        b = -0.5 * (x - miu)@[email protected](x - miu).T
        return 1 / a * np.exp(b)

    def pM(self, x, label):
        pm = 0
        for i in range(self.k):
            pm += self.Alpha[i] * self.p(x, i)
        return self.Alpha[label] * self.p(x, label) / pm

    def update_miu(self, X, label):

        a = 0
        b = 0
        for i in range(X.shape[0]):
            a += self.Gamma[i][label] * X[i]
            b += self.Gamma[i][label]
        if b == 0:
            b = 1e-10
        return a / b

    def update_sigma(self, X, label):

        a = 0
        b = 0
        for i in range(X.shape[0]):
            X[i] = X[i].reshape(1, -1)
            miu = self.mius_array[label].reshape(1, -1)
            a += self.Gamma[i][label] * \
                (X[i] - miu).[email protected](X[i] - miu)
            b += self.Gamma[i][label]
        if b == 0:
            b = 1e-10
        sigma = a / b
        return sigma

    def update_alpha(self, X, label):

        a = 0
        for i in range(X.shape[0]):
            a += self.Gamma[i][label]
        return a / X.shape[0]

    def LL(self, X):
        ll = 0
        for i in range(X.shape[0]):
            before_ln = 0
            for j in range(self.k):
                before_ln += self.Alpha[j] * self.Gamma[i][j]
            ll += np.log(before_ln)
        return ll

    def fit(self, X):
        self.Alpha = np.array([1 / self.k] * self.k)  # 初始化alpha
        self.mius_array = self.get_miu(X)  # 初始化miu
        self.Sigma = np.array(
            [np.eye(X.shape[1], dtype=float) * 0.1] * self.k)  # 初始化sigma
        self.Gamma = np.zeros([X.shape[0], self.k])

        Y = np.zeros([X.shape[0]])
        iter = 0
        while(iter < self.max_iter):
            old_ll = self.ll
            for i in range(X.shape[0]):
                for j in range(self.k):
                    self.Gamma[i][j] = self.pM(X[i], j)
            for i in range(self.k):
                self.mius_array[i] = self.update_miu(X, i)
                self.Sigma[i] = self.update_sigma(X, i)
                self.Alpha[i] = self.update_alpha(X, i)
            self.ll = self.LL(X)
            print(self.ll)
            if abs(self.ll - old_ll) < 0.01:
                print('迭代{}次退出'.format(iter))
                break
            iter += 1

        if iter == self.max_iter:
            print("迭代超过{}次,退出迭代".format(self.max_iter))
        for i in range(X.shape[0]):
            tmp_y = -1
            tmp_gamma = -1
            for j in range(self.k):
                if tmp_gamma < self.Gamma[i][j]:
                    tmp_gamma = self.Gamma[i][j]
                    tmp_y = j
            Y[i] = tmp_y
        return Y


if __name__ == '__main__':

    fig = plt.figure(1)

    plt.subplot(221)
    center = [[1, 1], [-1, -1], [1, -1]]
    cluster_std = 0.35
    X1, Y1 = make_blobs(n_samples=1000, centers=center,
                        n_features=2, cluster_std=cluster_std, random_state=1)
    plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

    plt.subplot(222)
    km1 = MoG(k=3, max_iter=20)
    km_Y1 = km1.fit(X1)
    mius = km1.mius_array
    plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=km_Y1)
    plt.scatter(mius[:, 0], mius[:, 1], marker='^', c='r')

    plt.subplot(223)
    X2, Y2 = make_moons(n_samples=1000, noise=0.1)
    plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2)

    plt.subplot(224)
    km2 = MoG(k=2, max_iter=20)
    km_Y2 = km2.fit(X2)
    mius = km2.mius_array
    plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=km_Y2)
    plt.scatter(mius[:, 0], mius[:, 1], marker='^', c='r')

    plt.show()

运行聚类图片如下:
原型聚类(三)高斯混合聚类和python实现

参考

《机器学习》周志华

相关文章:

  • 2022-12-23
  • 2021-09-08
  • 2022-12-23
  • 2021-09-18
  • 2022-12-23
  • 2021-04-06
  • 2021-08-04
  • 2022-12-23
猜你喜欢
  • 2022-03-02
  • 2021-12-04
  • 2021-12-15
  • 2022-01-14
  • 2021-04-17
  • 2022-12-23
  • 2021-11-17
相关资源
相似解决方案