【发布时间】:2018-07-16 23:46:32
【问题描述】:
我正在使用 Numpy 并尝试计算大型矩阵 (300000 x 70000) 的均值和协方差。 我有 32GB 大小的可用内存。就计算效率和易于实施而言,这项任务的最佳实践是什么?
我目前的实现如下:
def compute_mean_variance(mat, chunk_size):
row_count = mat.row_count
col_count = mat.col_count
# maintain the `x_sum`, `x2_sum` array
# mean(x) = x_sum / row_count
# var(x) = x2_sum / row_count - mean(x)**2
x_sum = np.zeros([1, col_count])
x2_sum = np.zeros([1, col_count])
for i in range(0, row_count, chunk_size):
sub_mat = mat[i:i+chunk_size, :]
# in-memory sub_mat of size chunk_size x num_cols
sub_mat = sub_mat.read().val
x_sum += np.sum(sub_mat, 0)
x2_sum += x2_sum + np.sum(sub_mat**2, 0)
x_mean = x_sum / row_count
x_var = x2_sum / row_count - x_mean ** 2
return x_mean, x_var
有什么改进建议吗?
我发现下面的实现应该更容易理解。它还使用 numpy 计算列块的平均值和标准差。所以它应该更高效且数值稳定。
def compute_mean_std(mat, chunk_size):
row_count = mat.row_count
col_count = mat.col_count
mean = np.zeros(col_count)
std = np.zeros(col_count)
for i in xrange(0, col_count, chunk_size):
sub_mat = mat[:, i : i + chunk_size]
# num_samples x chunk_size
sub_mat = sub_mat.read().val
mean[i : i + chunk_size] = np.mean(sub_mat, axis=0)
std[i : i + chunk_size] = np.std(sub_mat, axis=0)
return mean, std
【问题讨论】:
-
是python 2还是python 3?
标签: python numpy matrix linear-algebra