哪种格式的 scipy.sparse 最适合这种类型的矩阵生成和使用？答案

【问题标题】：Which format of scipy.sparse is best for this type of matrix generation and use?哪种格式的 scipy.sparse 最适合这种类型的矩阵生成和使用？
【发布时间】：2015-09-05 00:35:14
【问题描述】：

我有一个数据文件，它编码有关大型稀疏布尔矩阵的非零元素的信息。该矩阵没有任何特定的结构，即它不是对角线或块等。文件的每一行确定一个元素。现在我使用以下循环来填充矩阵：

from scipy.sparse import dok_matrix

nRows = 30000
nCols = 600000

data = dok_matrix((nRows,nCols), dtype=np.int8)

with open('input.txt','r') as fraw:
    for line in fraw:
        ## Figure out iRow and iCol to set to 1 from line
        data[iRow,iCol] = 1

这是有效的，但速度很慢。有没有更优化的不同类型的scipy.sparse 矩阵？

'Optimal' 表示矩阵的生成速度和矩阵的行和列块的访问速度，例如像

这样的向量运算

someRows = data[rowIndex1:rowIndex2,]
someColumns = data[,colIndex1:colIndex2]

如果记忆比速度更重要，答案会改变吗？

谢谢

【问题讨论】：

标签： python scipy vectorization sparse-matrix

【解决方案1】：

对于像这样的增量添加，dok 是最好的。它实际上是一个将值存储在元组中的字典：(iRow,iCol)。所以存储和获取取决于基本的 Python 字典效率。

唯一适合增量添加的是lil，它将数据存储为2个列表列表。

另一种方法是在 3 个列表中收集数据，并在最后构建矩阵。从coo 及其(data,(i,j)) 输入法开始。

密集的numpy 数组从带有genfromtxt 或loadtxt 的文件中加载。两者都逐行读取文件，在列表列表中收集值，最后创建数组。

如果您只是读取文件并解析值而不将任何内容保存到dok，速度会怎样？这将使您了解实际将数据添加到矩阵中花费了多少时间。

另一种可能性是将值直接存储到通用字典中，并使用它来创建dok。

In [60]: adict=dict()

In [61]: for i in np.random.randint(1000,size=(2000,)):
    adict[(i,i)]=1
   ....:     

In [62]: dd=sparse.dok_matrix((1000,1000),dtype=np.int8)

In [63]: dd.update(adict)

In [64]: dd.A
Out[64]: 
array([[1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]], dtype=int8)

这比直接更新dok要快很多。

In [66]: %%timeit 
for i in np.random.randint(1000,size=(2000,)):
    adict[(i,i)]=1
dd.update(adict)
   ....: 
1000 loops, best of 3: 1.32 ms per loop

In [67]: %%timeit 
for i in np.random.randint(1000,size=(2000,)):
    dd[i,i]=1
   ....: 
10 loops, best of 3: 35.6 ms per loop

更新dok 肯定有一些我没有考虑到的开销。

我刚刚意识到我曾经建议过这种update 方法：

https://stackoverflow.com/a/27771335/901925 Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

【讨论】：

我认为，虽然我没有验证，但大部分时间都花在了 data[iRow,iCol] = 1 分配上。计算iRow 和iCol 包括获取字符串列表元素的索引。
我添加了一些关于将值添加到通用字典的注释，然后使用update 创建dok。