从具有均值 μ 和协方差矩阵 S 的多元正态分布中生成 200 个数据点，其中 [关闭]答案

【问题标题】：Generate 200 data points drawn from a multivariate normal distribution with mean μ and covariance matrix S, where [closed]从具有均值 μ 和协方差矩阵 S 的多元正态分布中生成 200 个数据点，其中 [关闭]
【发布时间】：2022-01-12 17:30:30
【问题描述】：

import numpy as np
   from numpy import sin, cos, pi
   from matplotlib.pyplot import *
   rng = np.random.default_rng(42) 
   N = 200
   center = 10, 15
   sigmas = 10, 2
   theta = 20 / 180 * pi  
   # covariance matrix
   rotmat = np.array([[cos(theta), -sin(theta)],[sin(theta), cos(theta)]])
   diagmat = np.diagflat(sigmas)
   mean =np.array([−1,−2,−3])
   # covar = rotmat @ diagmat @ rotmat.T
   covar= np.array([[2, 2 ,0],[2 ,3, 1],[0, 1 ,19]])
   print('covariance matrix:')
   print(covar)`enter code here`
   eigval, eigvec = np.linalg.eigh(covar)
   print(f'eigenvalues: {eigval}\neigenvectors:\n{eigvec}')
   print('angle of eigvector corresponding to larger eigenvalue:',
      180 /pi * np.arctan2(eigvec[1,1], eigvec[0,1]))

    # PCA
mean = data.mean(axis=0)
print('mean:', mean)
# S1: explicit sum
S1 = np.zeros((2,2), dtype=float)
print(len(data))
for i in range(len(data)):
  S1 += np.outer(data[i] - mean, data[i] - mean)
S1 /= len(data)
print(f'S1= (explicit sum)\n{S1}')
# S2: 
S2 = np.cov(data, rowvar=False, bias=True)
print(f'S2= (np.cov)\n{S2}')
# PCA:
lambdas, u = np.linalg.eigh(S2)
print(f'\nPCA\nlambda={lambdas}\nu=\n{u}')
u1 = u[:,1] # largest
print('u1=\n',u1)
print(f'first principal component angle: {180/pi*np.arctan2(u1[1], u1[0])}')

之后，我需要对上述数据执行 PCA 到一个主成分和两个主成分。这两个中的分数解释方差是多少案例

【问题讨论】：

您将center 传递给rng.multivariate_normal(...)。那应该是mean，而不是center。
请将您的相关代码（带有正确缩进）和错误信息以 text 形式发布，not 以图像形式发布。
相关代码从 import NumPy 开始滚动到下面，以 print(f'fractional Explained variance: {var_explained}') 结束，现在，我想将数据的前两个主成分绘制在二维散点图。什么是轴？从数值上验证重建点与原始数据之间的均方距离（测量的剩余方差）接近数据总方差和前两个主成分的解释方差比的预期值。
再次，请在此处以 text 的形式明确发布您的相关代码（print(f'fractional explained variance: {var_explained}') 无处可见 - rng.multivariate_normal 来自哪里？）；由于您使用的是随机生成的数据，因此这是 minimal reproducible example 的完美案例。请不要使用 cmets 空间来添加信息和说明 - 如有必要，请编辑和更新您的帖子。
# PCA mu = data.mean(axis=0) print('mean:', mu) # S1: 显式总和 S1 = np.zeros((2,2), dtype=float) print(len(data)) for i in range(len(data)): S1 += np.outer(data[i] - mu, data[i] - mu) S1 /= len(data) print(f' S1= (explicit sum)\n{S1}') # S2: S2 = np.cov(data, rowvar=False, bias=True) print(f'S2= (np.cov)\n{S2}') # PCA: lambdas, u = np.linalg.eigh(S2) print(f'\nPCA\nlambda={lambdas}\nu=\n{u}') u1 = u[:,1] # 最大打印(' u1=\n',u1) print(f'第一主成分角度：{180/pi*np.arctan2(u1[1], u1[0])}')

标签： python numpy machine-learning pca

【解决方案1】：

要生成数据，您需要两个技巧：

使用特征值-特征向量分解计算协方差矩阵 S 的“平方根”
使用标准公式生成具有给定均值和协方差的随机正态。使用 Numpy，它适用于向量（引用自 help(np.random.randn)）：

For random samples from :math:`N(\mu, \sigma^2)`, use:

``sigma * np.random.randn(...) + mu``

例子：

import numpy as np
import matplotlib.pyplot as plt

# Part 1: generating random normal data with the given mean and covariance
N = 200

# covariance matrix
S = np.array([[2, 2, 0], [2, 3, 1], [0, 1, 19]])

# mean
mu = np.array([[-1, -2, -3]]).T

# get "square root" of covariance matrix via eigenfactorization
w, v = np.linalg.eig(S)
sigma = np.sqrt(w) * v

# ready, set, go!
A = sigma @ np.random.randn(3, N) + mu

print(f'sample covariance:\n{np.cov(A)}')

# sample covariance:
# [[ 1.70899164  1.74288639  0.21190326]
#  [ 1.74288639  2.59595547  1.2822817 ]
#  [ 0.21190326  1.2822817  22.04077608]]

print(f'sample mean:\n{A.mean(axis=1)}')

# sample mean:
# [-1.02385787 -1.87783415 -2.96077204]

# --------------------------------------------
# Part 2: principal component analysis on random data A

# estimate the sample covariance
R = np.cov(A)
# do the PCA
lam, u = np.linalg.eig(R)

# fractional explained variance is the relative magnitude of
# the accumulated eigenvalues

# reorder the eigenvalues & vectors with hottest eigenvalues first
col_order = np.argsort(lam)[::-1]
lam = lam[col_order]
u = u[:, col_order]

print(f'eigenvalues: {lam}')
# eigenvalues: [22.13020272  3.87946467  0.3360558 ]

var_explained = lam.cumsum() / lam.sum()
print(f'fractional explained variance: {var_explained}')
# fractional explained variance: [0.83999223 0.98724439 1.        ]
#                                  ^^ 84% in first dimension alone,
#                                     99% in first two dimensions

# do the projection
B = u.T @ A

# now the variance in B is concentrated in the first two dimensions
covariance after PCA projection:
[[ 2.21302027e+01 -2.68545720e-15 -1.60675493e-15]
 [-2.68545720e-15  3.87946467e+00 -1.19613978e-15]
 [-1.60675493e-15 -1.19613978e-15  3.36055802e-01]]

# scatter plot
plt.plot(B[0], B[1], '.')
plt.axis('equal')
plt.grid('on')
plt.xlabel('principal axis 0')
plt.ylabel('principal axis 1')
plt.title('Random data projected onto two principal axes')


# project back using ONLY a two dimensional subspace of B
#  i.e. drop the last eigenvector
A_approx = u[:,:2] @ B[:2,:]

# error analysis
err3 = A - A_approx
mse = (err3**2).sum(axis=0).mean()

print(f'predicted error variance: {lam[-1]}')
print(f'measured error variance: {mse}')
# predicted error variance: 0.3360558019705344
# measured error variance: 0.41137559916273914

【讨论】：

谢谢，@AbbeGily 如果我想在二维散点图中绘制数据的前两个主要成分怎么办。什么是轴？从数值上验证重建点与原始数据之间的均方距离（测量的剩余方差）接近数据总方差和前两个主成分的解释方差比的预期值。
@Mohammed 响应已更新，包括投影（带散点图）、二维子空间的反向投影和错误分析
这真的很酷，谢谢