通过它在python中的值找到稀疏二维矩阵的y索引答案

【问题标题】：finding y index of a sparse 2D matrix by its value in python通过它在python中的值找到稀疏二维矩阵的y索引
【发布时间】：2016-04-01 22:08:20
【问题描述】：

我有一个大小为 (1000,10000) 的二维稀疏矩阵 "unknown_tfidf"，其类型为：

<class 'scipy.sparse.csr.csr_matrix'>

我需要获取该矩阵的 y 索引，其中值为 '1'，我正在尝试以下方法（不确定它是否是最佳方法，甚至是正确的方法！）但我遇到了一个错误：

y=[row.index(1.0) for index, row in enumerate(unknown_tfidf) if int(1.0) in row]

错误是：

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

我的问题是我怎样才能只获得矩阵值为 1 的此类矩阵的所有 y 索引？

【问题讨论】：

ValueError 是由尝试在标量 if 上下文中使用布尔数组引起的，即您的 if <a boolean array> 短语。
我在那个 if 语句中尝试了很多形式，但我仍然有同样的错误！有什么方法、任何转换器或其他有效方法可以应用这种 if 语句？

标签： python matrix scipy sparse-matrix

【解决方案1】：

压缩稀疏行 (CSR) 矩阵等于 1 的列的索引存储在其.indices 属性中：

import numpy as np
import scipy.sparse as sparse
np.random.seed(2016)

arr = np.round(10*sparse.rand(10, 10, density=0.8, format='csr'))
# arr.A
# array([[  5.,   0.,   7.,   7.,   8.,   7.,   0.,   2.,   4.,   2.],
#        [  4.,   0.,   9.,   2.,   4.,   8.,   4.,   2.,   5.,   9.],
#        [  7.,   4.,   4.,   2.,   4.,   0.,   0.,   0.,   6.,   0.],
#        [  8.,   0.,   0.,   7.,   0.,   6.,   5.,   8.,   0.,   3.],
#        [  3.,   5.,   1.,   0.,   0.,   7.,   3.,   8.,   3.,   0.],
#        [  8.,   6.,   7.,   0.,   8.,   2.,   7.,   0.,   1.,   1.],
#        [  4.,   6.,   3.,   1.,   8.,   7.,   8.,   6.,   0.,   2.],
#        [  7.,   7.,   0.,  10.,   6.,   2.,   4.,   2.,   1.,  10.],
#        [ 10.,   0.,   4.,   8.,   1.,   1.,   3.,   1.,   9.,   1.],
#        [  0.,   4.,   0.,   0.,   7.,   2.,  10.,   1.,   9.,   0.]])

condition = (arr == 1)
print(condition.indices)

产量

[2 8 9 3 8 4 5 7 9 7]

The fastest way 找到arr 等于1 的行和列索引，是将arr 转换为COO 矩阵，然后读取其row 和col 属性：

coo = condition.tocoo()
print(coo.row)
print(coo.col)

产量

[4 5 5 6 7 8 8 8 8 9]
[2 8 9 3 8 4 5 7 9 7]

【讨论】：

sparse nonzero 方法本质上就是这样做的（转换为coo）（添加测试以防.data 中有0）。

【解决方案2】：

您的列表理解适用于嵌套列表

In [100]: xl=[[0,1,3],[0,0,1],[1,1,0]]
In [101]: [row.index(1) for index, row in enumerate(xl) if 1 in row]
Out[101]: [1, 2, 0]

（注意index 只返回第三行中的第一个匹配项）。

但不适用于numpy.array：

In [102]: xa=np.array(xl)
In [103]: [row.index(1) for index, row in enumerate(xa) if 1 in row]
...
AttributeError: 'numpy.ndarray' object has no attribute 'index'

而不是稀疏矩阵：

In [104]: xs=sparse.csr_matrix(xl)
In [105]: xs
Out[105]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>
In [106]: [row.index(1) for index, row in enumerate(xs) if 1 in row]
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

如果我删除 if 测试，我会得到一个不同的错误，即密集数组错误的变体。

In [108]: [row.index(1) for index, row in enumerate(xs)]
...
AttributeError: index not found

看看枚举给了我们什么工作；

In [109]: [(index,row) for index, row in enumerate(xs)]
Out[109]: 
[(0, <1x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>),
 (1, <1x3 sparse matrix of type '<class 'numpy.int32'>'
    with 1 stored elements in Compressed Sparse Row format>),
 (2, <1x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>)]

row 是另一个稀疏矩阵，与xs[0] 等相同。因此1 in row 和row.index(1) 表达式必须与数组或矩阵一起使用，否则会出错。

我们已经看到index 方法也没有。那是一种列表方法——你必须对数组或稀疏矩阵使用其他东西。您的理解包含if 子句，因为如果找不到该项目，列表index 会引发错误。从这个意义上说，if in 和 index 一起出现。

in 适用于数组，但给出稀疏矩阵的值错误：

In [114]: 1 in xa[0]
Out[114]: True
In [115]: 1 in xs[0]
....
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

更常见的 ValueError 是由以下等价物产生的：

In [117]: if np.array([True, False, True]):'yes'
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

也就是说，给if 一个布尔数组。在您的情况下，此故障发生在 sparse 代码中。实际上，in 尚未针对稀疏实现。

因此，如果您坚持使用这种列表推导方法，则必须将稀疏矩阵转换为列表列表：

In [120]: [row.index(1) for index, row in enumerate(xs.toarray().tolist()) if 1 in row]
Out[120]: [1, 2, 0]

这是unutbu's 答案的变体：

使用矩阵/数组相等性测试来查找所有匹配的元素：

In [121]: xs==1
Out[121]: 
<3x3 sparse matrix of type '<class 'numpy.bool_'>'
    with 4 stored elements in Compressed Sparse Row format>
In [122]: (xs==1).A
Out[122]: 
array([[False,  True, False],
       [False, False,  True],
       [ True,  True, False]], dtype=bool)

然后使用内置方法获取那些True 元素的索引：

In [123]: (xs==1).nonzero()
Out[123]: (array([0, 1, 2, 2], dtype=int32), array([1, 2, 0, 1], dtype=int32))

该元组的第二个元素是您想要的列表（第 3 行有 2 个值）。

或者收集行的值（记住，在迭代每一行时是一个矩阵）

In [125]: [i.nonzero() for i in (xs==1)]
Out[125]: 
[(array([0], dtype=int32), array([1], dtype=int32)),
 (array([0], dtype=int32), array([2], dtype=int32)),
 (array([0, 0], dtype=int32), array([0, 1], dtype=int32))]

将该列表简化为简单的索引列表需要更多的摆弄

In [131]: [i.nonzero()[1].tolist() for i in (xs==1)]
Out[131]: [[1], [2], [0, 1]]

【讨论】：

哇！非常感谢！现在我明白了，当然这两种解决方案在我的情况下都能正常工作。