【发布时间】:2020-12-04 10:11:10
【问题描述】:
我使用此帖子GMMs and Maximum Likelihood Optimization Using NumPy 为GMM 实施了EM algorithm 未成功如下:
import numpy as np
def PDF(data, means, variances):
return 1/(np.sqrt(2 * np.pi * variances) + eps) * np.exp(-1/2 * (np.square(data - means) / (variances + eps)))
def EM_GMM(data, k, iterations):
weights = np.ones((k, 1)) / k # shape=(k, 1)
means = np.random.choice(data, k)[:, np.newaxis] # shape=(k, 1)
variances = np.random.random_sample(size=k)[:, np.newaxis] # shape=(k, 1)
data = np.repeat(data[np.newaxis, :], k, 0) # shape=(k, n)
for step in range(iterations):
# Expectation step
likelihood = PDF(data, means, np.sqrt(variances)) # shape=(k, n)
# Maximization step
b = likelihood * weights # shape=(k, n)
b /= np.sum(b, axis=1)[:, np.newaxis] + eps
# updage means, variances, and weights
means = np.sum(b * data, axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps)
variances = np.sum(b * np.square(data - means), axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps)
weights = np.mean(b, axis=1)[:, np.newaxis]
return means, variances
当我在一维时间序列数据集上运行算法时,对于 k 等于 3,它返回如下输出:
array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
3.05053810e-003, 2.36989898e-025, 2.36989898e-025,
1.32797395e-136, 6.91134950e-031, 5.47347807e-001,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 2.25849208e-064, 0.00000000e+000,
1.61228562e-303, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 3.94387272e-242,
1.13078186e+000, 2.53108878e-001, 5.33548114e-001,
9.14920432e-001, 2.07015697e-013, 4.45250680e-038,
1.43000602e+000, 1.28781615e+000, 1.44821615e+000,
1.18186109e+000, 3.21610659e-002, 3.21610659e-002,
3.21610659e-002, 3.21610659e-002, 3.21610659e-002,
2.47382844e-039, 0.00000000e+000, 2.09150855e-200,
0.00000000e+000, 0.00000000e+000],
[5.93203066e-002, 1.01647068e+000, 5.99299162e-001,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 2.14690238e-010,
2.49337135e-191, 5.10499986e-001, 9.32658804e-001,
1.21148135e+000, 1.13315278e+000, 2.50324069e-237,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 1.73966953e-125, 2.53559290e-275,
1.42960975e-065, 7.57552338e-001],
[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
3.05053810e-003, 2.36989898e-025, 2.36989898e-025,
1.32797395e-136, 6.91134950e-031, 5.47347807e-001,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
1.44637007e+000, 2.25849208e-064, 0.00000000e+000,
1.61228562e-303, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 3.94387272e-242,
1.13078186e+000, 2.53108878e-001, 5.33548114e-001,
9.14920432e-001, 2.07015697e-013, 4.45250680e-038,
1.43000602e+000, 1.28781615e+000, 1.44821615e+000,
1.18186109e+000, 3.21610659e-002, 3.21610659e-002,
3.21610659e-002, 3.21610659e-002, 3.21610659e-002,
2.47382844e-039, 0.00000000e+000, 2.09150855e-200,
0.00000000e+000, 0.00000000e+000]])
我认为这是错误的,因为输出是两个向量,其中一个代表means 值,另一个代表variances 值。让我对实现产生怀疑的模糊点是它会返回0.00000000e+000,因为它可以看到大多数输出,并且不需要真正可视化这些输出。顺便说一句,输入数据是时间序列数据。我已经检查了所有内容并多次跟踪,但没有出现任何错误。
这是我的输入数据:
[25.31 , 24.31 , 24.12 , 43.46 , 41.48666667,
41.48666667, 37.54 , 41.175 , 44.81 , 44.44571429,
44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429,
44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429,
44.44571429, 44.44571429, 39.71 , 26.69 , 34.15 ,
24.94 , 24.75 , 24.56 , 24.38 , 35.25 ,
44.62 , 44.94 , 44.815 , 44.69 , 42.31 ,
40.81 , 44.38 , 44.56 , 44.44 , 44.25 ,
43.66666667, 43.66666667, 43.66666667, 43.66666667, 43.66666667,
40.75 , 32.31 , 36.08 , 30.135 , 24.19 ]
我想知道是否有通过numpy 或SciKit-learn 实现它的优雅方式。任何帮助将不胜感激。
更新 以下是当前输出和预期输出:
【问题讨论】:
-
您能分享一下您认为这是错误的原因吗?可视化可能会有所帮助,但即使没有,似乎也缺少一些解释
-
@dia 输出是两个向量,其中一个代表
means值,另一个代表variances值。让我对实现产生怀疑的模糊点是它会返回0.00000000e+000,因为它可以看到大多数输出,并且不需要真正可视化这些输出。顺便说一句,输入数据是时间序列数据。 -
可以可视化。我有一个关于平均值的链接帖子。你为什么不参考它。
-
@dia 你的意思是post 或GMM/EM on time series cluster。您是否有任何报价和解决方案可以在输出或解释我当前的结果时获得正确的结果?
-
一开始我会得到这些我觉得不合适的cmets,但现在我似乎得到了它们。请允许我改写一下。您必须让其他人更容易理解和解决您的问题,否则您必须自己解决。
标签: python numpy machine-learning scikit-learn gmm