补充知识

  • 方差var(X)=E((XE(X)2)var(X) = E((X - E(X)^2)
  • 协方差cov(X,Y)=E((XE(X))(YE(Y)))cov(X, Y) = E((X - E(X))(Y - E(Y)))
  • 相关系数ρ(X,Y)=voc(X,Y)var(X)var(Y)\rho(X, Y) = \frac{voc(X, Y)}{\sqrt{var(X)var(Y)}}

1.思路:主成分分析就是一种降维,要把坐标(x1,x2)(x_1, x_2)降维到坐标(y1,y2)(y_1, y_2)中的y1y_1;具体的思路就是让下图中的OA2+OB2+OC2OA'^2 + OB'^2 + OC'^2(方差)最大(图中把原本的三个向量压缩到y = x这一直线上,实际上就是找到一条直线,使上述方差最大)

降维笔记-主成分分析

2.总体主成分

随机变量:x=(x1,,xm)T\vec{x} = (x_1, \cdots, x_m)^T

均值向量:μ=(μ1,,μm)T\vec{\mu} = (\mu_1, \cdots, \mu_m)^T

协方差矩阵:Σ=cov(x,x)=E((xμ)(xμ)T)\Sigma = cov(\vec{x}, \vec{x}) = E((\vec{x} - \vec{\mu})(\vec{x} - \vec{\mu})^T)

x\vec{x}映射到y=(y1,y2,,ym)T\vec{y} = (y_1, y_2, \cdots, y_m)^T

αi=(α1i,,αmi)Tyi=αix=α1ix1++αmixm \vec{\alpha}_i = (\alpha_{1i}, \cdots, \alpha_{mi})^T\\ y_i = \vec{\alpha}_i\vec{x} = \alpha_{1i}x_1 + \cdots + \alpha_{mi}x_m
可以得到y\vec{y}的相关数据
{E(yi)=αiTμivar(yi)=αiTΣαicov(yi,yj)=αiTΣαj \begin{cases} E(y_i) = \vec{\alpha}_i^T \vec{\mu}_i\\ var(y_i) = \vec{\alpha}_i^T \Sigma \vec{\alpha}_i\\ cov(y_i, y_j) = \vec{\alpha}_i^T\Sigma \vec{\alpha}_j \end{cases}
定义主成分:(上述)线性变换满足如下条件

(1)αi\vec{\alpha}_i是单位向量αiTαj=1,i=1,2,,m\vec{\alpha}_i^T\vec{\alpha}_j = 1, i = 1, 2, \cdots, m

(2)yiy_iyjy_j不相关,即cov(yi,yj)=0,(ij)cov(y_i, y_j) = 0, (i \neq j)

(3)y1y_1x\vec{x}所有线性变换中,方差最大的,即var(y1)var(y_1)最大;y2y_2y1y_1不相关,且方差最大……

3.计算主成分

Σ\Sigmax\vec{x}协方差矩阵

Σ\Sigma特征值为λ1λ2λm>0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_m > 0

对应的单位特征向量为α1,,αm\vec{\alpha}_1, \cdots, \vec{\alpha}_m

则第k主成分是yk=αkTxy_k = \vec{\alpha}_k^T \vec{x}

推论y=(y1,,ym)T\vec{y} = (y_1, \cdots, y_m)^T的分量依次是x\vec{x}的1到m主成分的充要条件为

(1)y=ATx,A=(α1,,αm)\vec{y} = A^T \vec{x}, A = (\vec{\alpha}_1, \cdots, \vec{\alpha}_m),这里αk\vec{\alpha}_kλk\lambda_k对应的单位特征向量

(2)cov(y)=diag(λ1,,λm),λ1,λmcov(\vec{y}) = diag(\lambda_1, \cdots, \lambda_m), \lambda_1 \geq \cdots, \lambda_m

4.总体主成分y\vec{y}的性质

(1)cov(y)=Λ=diag(λ1,,λm)cov(\vec{y}) = \Lambda = diag(\lambda_1, \cdots, \lambda_m)

(2)i=1mλi=i=1mσii\sum^m_{i = 1}\lambda_i = \sum^m_{i = 1}\sigma_{ii},其中σii\sigma_{ii}x\vec{x}的方差,也就是Σ\Sigma矩阵的对角线元素

(3)相关系数ρ(yk,xi)\rho(y_k, x_i)称为因子负荷量
ρ(yk,xi)=λkαikσii \rho(y_k, x_i) = \frac{\sqrt{\lambda}_k\alpha_{ik}}{\sqrt{\sigma_{ii}}}
(4)i=1mσiiρ2(yk,xi)=λk\sum^m_{i = 1}\sigma_{ii}\rho^2(y_k, x_i) = \lambda_k

(5)k=1mρ2(yk,xi)=1\sum^m_{k = 1}\rho^2(y_k, x_i) = 1

(6)贡献率ηk=λki=1mλi\eta_k = \frac{\lambda_k}{\sum^m_{i = 1}\lambda_i}

前k个主成分y1,,yky_1, \cdots, y_k对的贡献率

mηi=i=1kλii=1mλi \sum^m\eta_i = \frac{\sum^k_{i = 1}\lambda_i}{\sum^m_{i = 1}\lambda_i}

5.主成分个数:如果要取前q个主成分,则用y=AqTx\vec{y} = A_q^T \vec{x},其中AqA_q是_计算主成分的推论_中的正交矩阵A的前q列

6.规范化变量

思路:为了消除量纲不同造成的方差大小的不一样,令xi=xiE(xi)var(xi)x^*_i = \frac{x_i - E(x_i)}{\sqrt{var(x_i)}},此时协方差矩阵Σ\Sigma^*就是相关矩阵R,设e1,,em\vec{e}_1^*, \cdots, \vec{e}_m^*是矩阵R的单位特征向量

(1)cov(y)=Λ=diag(λ1,,λm)cov(y^*) = \Lambda^* = diag(\lambda_1^*, \cdots, \lambda^*_m)

(2)k=1mλk=m\sum^m_{k = 1}\lambda_k = m

(3)ρ(yk,xi)=λeik\rho(y_k^*, x_i^*) = \sqrt{\lambda^*}e^*_{ik},其中eike_{ik}^*在单位特征向量中ek=(e1k,,emk)\vec{e}_k^* = (e_{1k}^*, \cdots, e_{mk}^*)

(4)i=1mρ2(yk,xi)=i=1mλkeik2=λk\sum^m_{i = 1}\rho^2(y_k^*, x_i^*) = \sum^m_{i = 1}\lambda_k^*e_{ik}^{*2} = \lambda_k^*

(5)k=1mρ2(yk,xi)=1\sum^m_{k = 1}\rho^2(y_k^*, x_i^*) = 1

7.样本主成分分析
X=(x1,,xn)X = (\vec{x}_1, \cdots, \vec{x}_n)用n个样本,每个样本m维

样本协方差绝阵S=[sij]m×n,sij=1n1k=1n(xikxˉi)(xjkxˉj)S = [s_{ij}]_{m \times n}, s_{ij} = \frac{1}{n - 1} \sum^n_{k = 1}(x_{ik} - \bar{x}_i)(x_{jk} - \bar{x}_j)

其中均值xˉi=1nk=1nxik,xˉj=1nk=1nxjk\bar{x}_i = \frac{1}{n}\sum^n_{k = 1}x_{ik}, \bar{x}_j = \frac{1}{n} \sum^n_{k = 1}x_{jk}

相关矩阵$ R = [r_{ij}]{m \times m}, r{ij} = \frac{s_{ij}}{\sqrt{s_{ii}s_{jj}}}$

同理,主成分的计算如下

y=(y1,,ym)T=ATxyi=αiTxA=(α1,,αm) \begin{aligned} &\vec{y} = (y_1, \cdots, y_m)^T = A^T\vec{x}\\ &y_i = \vec{\alpha}_i^T\vec{x}\\ &A = (\vec{\alpha}_1, \cdots, \vec{\alpha}_m ) \end{aligned}

得到方差:var(yi)=1n1j=1n(αiTxjαiTxˉ)=αiTSαivar(y_i) = \frac{1}{n - 1}\sum^n_{j = 1}(\vec{\alpha}_i^T\vec{x}_j - \vec{\alpha}_i^T\bar{\vec{x}}) = \vec{\alpha}_i^T S\vec{\alpha}_i

协方差:cov(yi,yk)=αiTSαkcov(y_i, y_k) = \vec{\alpha}_i^T S \vec{\alpha}_k

规范化样本xij=xijxˉisiix^*_{ij} = \frac{x_{ij} - \bar{\vec x}_i}{\sqrt{s_{ii}}}

在规范化的样本下:协方差矩阵S = 相关矩阵R = 1n1XXT\frac{1}{n - 1}XX^T

8.相关矩阵特征值分解算法(规定贡献率g)

(1)规范化数据(为方便表示,规范化的数据省略星号)

(2)相关矩阵R=[rij]m×m=1n1XXTR = [r_{ij}]_{m \times m} = \frac{1}{n - 1} XX^T

(3)解特征方程RλI=0|R - \lambda I| = 0,求出特征向量λ1λm\lambda_1 \geq \cdots \geq \lambda_m

保留贡献率之和达到g的k个主成分

求出对应单位特征向量αi=(α1i,,αmi)T,i=1,2,,k\vec{\alpha}_i = (\alpha_{1i}, \cdots, \alpha_{mi})^T, i = 1, 2, \cdots, k

(4)第i个样本主成分yi=αix\vec{y}_i = \vec{\alpha}_i\vec{x}

:普遍的理解是,我们先确定贡献率然后再确定到底保留多少个主成分,但是再sklearn实际应用的过程中是确定保留多少个主成分,然后才能用explained_variance_ratio_看这些主成分的贡献率是多少

而且在实际使用的过程中,哪怕是高维的向量,一般前几个主成分就可以达到99%的贡献率,贡献率并不能完全反映对信息的保留程度

9.奇异值主成分分析算法

思路:协方差S=1n1XXT=XTXS = \frac{1}{n - 1}XX^T = X'^TX',其中X=1n1XTX' = \frac{1}{\sqrt{n - 1}}X^T

容易看到X=UΣVTX' = U\Sigma V^TVtV^t的列向量记为单位特征向量

输入:m×nm\times n样本矩阵X,其每一行元素的均值为0(以规范化),主成分个数k

输出:k×nk \times n样本主成分矩阵Y

(1)工造新的n×mn \times m矩阵X=1n1XTX' = \frac{1}{\sqrt{n - 1}}X^T

(2)对X’进行截断奇异值分解,得到X=UkΣkVkTX' = U_k\Sigma_k V^T_k

(3)样本主成分矩阵Y=VTXY = V^TX

相关文章: