大距离矩阵的内存高效存储答案

【问题标题】：Memory-efficient storage of large distance matrices大距离矩阵的内存高效存储
【发布时间】：2018-05-04 00:14:17
【问题描述】：

我必须创建一个数据结构来将每个点到每个其他点的距离存储在一个非常大的二维坐标数组中。小数组很容易实现，但是超过 50,000 点后我开始遇到内存问题——这并不奇怪，因为我正在创建一个 n x n 矩阵。

这是一个运行良好的简单示例：

import numpy as np
from scipy.spatial import distance 

n = 2000
arr = np.random.rand(n,2)
d = distance.cdist(arr,arr)

cdist 速度很快，但存储效率低，因为矩阵是对角镜像的（例如d[i][j] == d[j][i]）。我可以使用np.triu(d) 转换为上三角矩阵，但生成的方阵仍然占用相同的内存。我也不需要超过某个截止值的距离，所以这会很有帮助。下一步是转换为稀疏矩阵以节省内存：

from scipy import sparse

max_dist = 5
dist = np.array([[0,1,3,6], [1,0,8,7], [3,8,0,4], [6,7,4,0]])
print dist

array([[0, 1, 3, 6],
       [1, 0, 8, 7],
       [3, 8, 0, 4],
       [6, 7, 4, 0]])

dist[dist>=max_dist] = 0
dist = np.triu(dist)
print dist

array([[0, 1, 3, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 4],
       [0, 0, 0, 0]])

sdist = sparse.lil_matrix(dist)
print sdist

(0, 1)        1
(2, 3)        4
(0, 2)        3

对于一个非常大的数据集，问题在于快速获取该稀疏矩阵。重申一下，使用 cdist 制作方阵是我所知道的计算点之间距离的最快方法，但中间方阵会耗尽内存。我可以将其分解为更易于管理的行块，但这会大大减慢速度。我觉得我缺少一些从cdist 直接进入稀疏矩阵的明显简单方法。

【问题讨论】：

好吧，第一步：docs.scipy.org/doc/scipy/reference/generated/…
另外，docs.scipy.org/doc/scipy/reference/generated/…

标签： python numpy

【解决方案1】：

下面是使用KDTree 的方法：

>>> import numpy as np
>>> from scipy import sparse
>>> from scipy.spatial import cKDTree as KDTree
>>> 
# mock data
>>> a = np.random.random((50000, 2))
>>> 
# make tree
>>> A = KDTree(a)
>>> 
# list all pairs within 0.05 of each other in 2-norm
# format: (i, j, v) - i, j are indices, v is distance
>>> D = A.sparse_distance_matrix(A, 0.05, p=2.0, output_type='ndarray')
>>> 
# only keep upper triangle
>>> DU = D[D['i'] < D['j']]
>>> 
# make sparse matrix
>>> result = sparse.coo_matrix((DU['v'], (DU['i'], DU['j'])), (50000, 50000))
>>> result
<50000x50000 sparse matrix of type '<class 'numpy.float64'>'
        with 9412560 stored elements in COOrdinate format>

【讨论】：

因为我必须查找它：output_type='ndarray' 不涉及为太远的对存储显式零，所以这确实应该减少峰值内存需求。
@Paul：谢谢，cKDTree 是个好主意 --- 这有助于大大扩展我的最大矩阵大小。不幸的是，我的机器仍然会出现超过 200,000 个点的内存错误，而我的实际数据文件通常超过 500,000 个点。我的工作解决方案是使用distance.cdist(node, nodes) 按需索引每个节点的距离，这样可以避免内存溢出，但由于它处于for循环中，所以速度很慢。我认为您的解决方案在具有足够内存的机器上会更快。