Scipy：来自数组的稀疏指示矩阵答案

【问题标题】：Scipy: Sparse indicator matrix from array(s)Scipy：来自数组的稀疏指示矩阵
【发布时间】：2019-07-10 07:34:03
【问题描述】：

从一个或两个数组a,b 计算稀疏布尔矩阵I 的最有效方法是什么，I[i,j]==True 其中a[i]==b[j]？以下是快速但内存效率低的：

I = a[:,None]==b

以下内容在创建过程中很慢并且仍然内存效率低：

I = csr((a[:,None]==b),shape=(len(a),len(b)))

下面至少给出了行，cols 以便更好地进行csr_matrix 初始化，但它仍然会创建完整的密集矩阵并且同样慢：

z = np.argwhere((a[:,None]==b))

有什么想法吗？

【问题讨论】：

我想我可以用 argsorting a 完成整个旅程，检测排序数组更改的索引，计算以这种方式确定的每个分区大小的组合，再次取消排序......但我希望有手头有一个简单的 numpy 或 scipy 函数...

标签： python numpy scipy sparse-matrix indicator

【解决方案1】：

一种方法是首先使用sets 识别a 和b 共有的所有不同元素。如果a 和b 中的值没有太多不同的可能性，这应该可以很好地工作。然后只需要遍历不同的值（在变量values 下方）并使用np.argwhere 来识别这些值出现的a 和b 中的索引。然后可以使用np.repeat 和np.tile 构造稀疏矩阵的二维索引：

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))

##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []

##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)

##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )

##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)

生成csr矩阵的语法取自documentation。稀疏矩阵相等的测试取自this post。

旧答案：

我不知道性能，但至少您可以通过使用简单的生成器表达式来避免构造完整的密集矩阵。这里有一些代码使用两个 1d 随机整数数组首先按照 OP 发布的方式生成稀疏矩阵，然后使用生成器表达式测试所有元素的相等性：

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

## matrix generation using generator
data, rows, cols = zip(
    *((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0)  ## --> True

我认为没有办法绕过双循环，理想情况下这将被推入numpy，但至少对于生成器，循环进行了一些优化......

【讨论】：

不幸的是，这太慢了:(
@RadioControlled 你的矩阵有多大？
我尝试了 15000 x 15000 和 10 个不同的值，但当然它主要取决于等式矩阵的稀疏性，这取决于不同值的分布。
@RadioControlled 当然，但我的建议很慢，因为您必须遍历所有 i 和所有 j 一维数组。也许通过您建议的排序确实可以加快速度，但我认为没有任何内置...
这似乎是一个很好的解决方案！我认为可以删除第一个建议。谢谢！

【解决方案2】：

您可以使用numpy.isclose 小公差：

np.isclose(a,b)

或pandas.DataFrame.eq:

a.eq(b)

请注意，这会返回 True False 的数组。

【讨论】：

在我看来np.isclose(a[:,None],b) 也返回了一个密集数组。 pandas 也一样，还需要更多依赖项...