内存管理：通过有效地迭代其他稀疏矩阵来构建稀疏矩阵答案

【问题标题】：Memory management: building sparse matrix by efficiently iterating through other sparse matrix内存管理：通过有效地迭代其他稀疏矩阵来构建稀疏矩阵
【发布时间】：2017-12-04 06:15:14
【问题描述】：

我正在尝试构建逐点互信息矩阵。我有一个 60k x 60k 单词共现的 scipy 矩阵，我想将其转换为另一个稀疏矩阵，其中条目 i,j 对应于 log( p(i,j) / p(i)*p(j) ) , 对于单词 i 和 j。我删除正值以获得 PPMI 矩阵。我正在寻找一种有效的方法来迭代第一个矩阵以生成第二个矩阵，而不会占用太多内存。

我尝试使用第一个矩阵的副本并对其进行迭代，并逐行构建一个新的 CSR 矩阵，使用 vstack 在 2 个稀疏矩阵上添加每个新行。由于内存错误，这两个进程都被杀死。构建此矩阵然后保存以供以后重用的最佳方法是什么？

from scipy.sparse import vstack
from scipy import sparse
if(inplace):
    for i in range(ctxt_matrix.shape[0]): #row-wise operation    
        #for each row (word vector), reweigh this in 3 steps:
        # 1. get the probability of this context, instead of the raw count (divide by total words)
        # 2. divide this probability by the probability of this row/context occurring together randomly (multiply entry
        #        for word all the other words, do element wise division)
        # 3. take the log of this division, and reassign the row to this.
        row_pmi = np.log(np.divide((ctxt_matrix[i].toarray().T/total_words),(word_probas*word_probas[i]))).T
        if(cutoff_0):
            row_pmi[row_pmi<0] = 0 #0 cutoff
        ctxt_matrix[i, :] = row_pmi
    print('PMI matrix building took:', time.time()-start)
    return ctxt_matrix

else:
    #same as above, but on a new matrix, using vstack.
    pmi_matrix = scipy.sparse.csr_matrix((1, ctxt_matrix.shape[1]))
    for i in range(ctxt_matrix.shape[0]): #row-wise operation
        row_pmi = scipy.sparse.csr_matrix(np.log(np.divide( ((ctxt_matrix[i].toarray().T)/total_words) , word_probas*word_probas[i] )).T)
        if(cutoff_0):
            row_pmi[row_pmi<0] = 0 #0 cutoff            
        pmi_matrix = scipy.vstack((pmi_matrix, row_pmi))
        del row_pmi
    print('PMI matrix building took:', time.time()-start)
    return pmi_matrix

TL;DR - 我需要执行逐行操作，通过迭代另一个矩阵来创建稀疏矩阵。这是一些简化的代码，用于了解我在做什么：

from scipy import sparse
import time
start = time.time()
ctxt_matrix = scipy.sparse.csr_matrix(scipy.sparse.rand(5000, 5000))
for i in range(ctxt_matrix.shape[0]):   
    row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
    row_pmi[row_pmi<0] = 0 # don't store negatives in memory
    ctxt_matrix[i,:] = scipy.sparse.csr_matrix(row_pmi).T
    ctxt_matrix[i, :].eliminate_zeros()
print('PMI matrix building took:', time.time()-start)

【问题讨论】：

vstack 使用bmat，它加入块的coo 属性以创建新矩阵。我会在矩阵列表上调用一次vstack，而不是迭代。
我非常关注您在 row_pmi 计算中所做的事情。它似乎与原始文件具有相同的稀疏性（非零入口位置），否则您会收到效率警告。我想知道这是否可以通过lil 格式的data 属性来完成。
如果您设置一个小测试用例会有所帮助。它不会测试内存限制，但会更容易测试和建议替代方法。
不幸的是，我认为我不能使用矩阵列表 - 我认为 64k 1x64k 矩阵的列表会给我一个内存错误。对于 row_pmi，word_probas 是每个单词概率的 1x64k np 数组：word_probas[i] = p(i)。对于顶部，对于第 i 行中的所有条目 j，ctxt_matrix[i]/total_words = p(i,j)。
我添加了一个任何人都可以使用的简化测试用例

标签： python memory-management scipy sparse-matrix

【解决方案1】：

我尝试了您的代码的一些变体：

import numpy as np
from scipy.sparse import vstack
from scipy import sparse

n, m = 10, 50000
source = sparse.random(n,m, 0.2, format='csr')*5000
print(repr(source))

ctxt_matrix = source.copy()
for i in range(ctxt_matrix.shape[0]):
    print(ctxt_matrix[i,:].nnz, end=' ')
    row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
    row_pmi[row_pmi<0] = 0 # don't store negatives in memory
    temp = sparse.csr_matrix(row_pmi).T
    print(temp.nnz)
    ctxt_matrix[i,:] = temp
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))

print('\nrow lil')
ctxt_matrix = source.tolil()
for i in range(ctxt_matrix.shape[0]):
    print(ctxt_matrix[i,:].nnz, end=' ')
    row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
    row_pmi[row_pmi<0] = 0 # don't store negatives in memory
    temp = sparse.lil_matrix(row_pmi).T
    print(temp.nnz)
    ctxt_matrix[i,:] = temp
print(repr(ctxt_matrix))

print('\nrow lil data')
ctxt_matrix = source.tolil()
for i in range(ctxt_matrix.shape[0]):
    data = np.array(ctxt_matrix.data[i])
    print(len(data))
    data = np.log(data/500) #some row-wise operation on the other matrix
    data[data<0] = 0 # don't store negatives in memory
    ctxt_matrix.data[i][:] = data
#print(repr(ctxt_matrix))
ctxt_matrix = ctxt_matrix.tocsr()
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))

print('\nwhole csr data')
ctxt_matrix = source.copy()
data = ctxt_matrix.data
data = np.log(data/500)
data[data<0] = 0
ctxt_matrix.data[:] = data
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))

结果

1407:~/mypy$ python3 stack47615473.py 
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 100000 stored elements in Compressed Sparse Row format>
stack47615473.py:12: RuntimeWarning: divide by zero encountered in log
  row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
10069 9081
9931 8943
10159 9134
10069 9043
9940 8924
9961 9009
9941 8939
9935 8923
9943 8983
10052 9072
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in Compressed Sparse Row format>

row lil
stack47615473.py:24: RuntimeWarning: divide by zero encountered in log
  row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
10069 9081
9931 8943
10159 9134
10069 9043
9940 8924
9961 9009
9941 8939
9935 8923
9943 8983
10052 9072
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in LInked List format>

row lil data
10069
9931
10159
10069
9940
9961
9941
9935
9943
10052
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in Compressed Sparse Row format>

whole csr data
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in Compressed Sparse Row format>

lil row 迭代比 csr 慢。

lil 和 csr 数据操作几乎是即时的。

还有一种方法可以直接迭代data 的csr 格式。这需要使用来自indptr 属性的值对其进行索引。这已在之前的 SO 问题中讨论过（可以查找一个。）

csr 行迭代有点慢，因为它每次都必须构造一个新的csr 矩阵。 toarray 步骤有点慢。如果您可以只对行或矩阵的非零 data 值进行操作，则速度会更快。

这并不能解决高内存使用问题。我希望矩阵的就地更改使用更少的内存，而重复的vstack 使用很多。我想知道，矩阵是否如此之大，以至于仅仅构造它的副本会产生内存错误？

【讨论】：

对于未来的读者：我最终使用了他上面的一些代码，但最有效的方法是迭代矩阵的各个元素而不是逐行，在 zip 中使用 a,b,c (matrix.row, matrix.col, matrix.data)