有效地从稀疏矩阵创建密集矩阵（numpy/scipy 但没有 sklearn）答案

【问题标题】：Create dense matrix from sparse matrix efficently (numpy/scipy but NO sklearn)有效地从稀疏矩阵创建密集矩阵（numpy/scipy 但没有 sklearn）
【发布时间】：2017-11-14 04:59:08
【问题描述】：

我有一个如下所示的 sparse.txt：

# first column is label 0 or 1
# rest of the data is sparse data
# maximum value in the data is 4, so the future dense matrix will
# have 1+4 = 5 elements in a row
# file: sparse.txt
1 1:1 2:1 3:1
0 1:1 4:1
1 2:1 3:1 4:1

所需的dense.txt是这样的：

# required file: dense.txt
1 1 1 1 0
0 1 0 0 1
1 0 1 1 1

在不使用 scipy coo_matrix 的情况下，它以如下简单的方式完成：

def create_dense(fsparse, fdense,fvocab):
    # number of lines in vocab
    lvocab = sum(1 for line in open(fvocab))

    # create dense file
    with open(fsparse) as fi, open(fdense,'w') as fo:
        for i, line in enumerate(fi):
            words = line.strip('\n').split(':')
            words = " ".join(words).split()

            label = int(words[0])
            indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]

            row = [0]* (lvocab+1)
            row[0] = label

            # use listcomps
            row = [ 1 if i in indices else row[i] for i in range(len(row))]

            l = " ".join(map(str,row)) + "\n"
            fo.write(l)

            print('Writing dense matrix line: ', i+1)

问题我们如何直接从稀疏数据中获取标签和数据，而无需先创建密集矩阵并优先使用 NUMPY /Scipy？

问题：我们如何使用 numpy.fromregex 读取稀疏数据？

我的尝试是：

def read_file(fsparse):
    regex = r'([0-1]\s)([0-9]):(1\s)*([0-9]:1)' + r'\s*\n'
    data = np.fromregex(fsparse,regex,dtype=str)

    print(data,file=open('dense.txt','w'))

没用！

【问题讨论】：

在列表中收集row 怎么样？那将是一个列表（数字），对吧？你能直接用那个做数组吗？
@hpaulj，我可以制作标签数组，但制作矩阵有困难。
@hpauj，我还可以使用 numpy.loadtxt 从文本文件中读取标签和数据，
我正在寻找一种使用 SCIPY COO_MATRIX、numpy fromregex 等的方法

标签： python numpy scipy sparse-matrix

【解决方案1】：

调整代码以直接创建密集数组，而不是通过文件：

fsparse = 'stack47266965.txt'

def create_dense(fsparse, fdense, lvocab):    
    alist = []
    with open(fsparse) as fi:
        for i, line in enumerate(fi):
            words = line.strip('\n').split(':')
            words = " ".join(words).split()

            label = int(words[0])
            indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]

            row = [0]* (lvocab+1)
            row[0] = label

            # use listcomps
            row = [ 1 if i in indices else row[i] for i in range(len(row))]
            alist.append(row)
    return alist

alist = create_dense(fsparse, fdense, 4)
print(alist)
import numpy as np
arr = np.array(alist)
from scipy import sparse
M = sparse.coo_matrix(arr)
print(M)
print(M.A)

生产

0926:~/mypy$ python3 stack47266965.py 
[[1, 1, 1, 1, 0], [0, 1, 0, 0, 1], [1, 0, 1, 1, 1]]
  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (0, 3)    1
  (1, 1)    1
  (1, 4)    1
  (2, 0)    1
  (2, 2)    1
  (2, 3)    1
  (2, 4)    1
[[1 1 1 1 0]
 [0 1 0 0 1]
 [1 0 1 1 1]]

如果要跳过密集的arr，则需要生成与M.row、M.col 和M.data 等价的属性（顺序无关紧要）

[0 0 0 0 1 1 2 2 2 2] 
[0 1 2 3 1 4 0 2 3 4] 
[1 1 1 1 1 1 1 1 1 1]

我不经常使用regex，所以我不会尝试解决这个问题。我假设你想转换

 '1 1:1 2:1 3:1'

进入

 ['1' '1' '2' '2' '1' '3' '1']

但这只会让您进入words/label 阶段。

直接到稀疏：

def create_sparse(fsparse, lvocab):

    row, col, data = [],[],[]
    with open(fsparse) as fi:
        for i, line in enumerate(fi):
            words = line.strip('\n').split(':')
            words = " ".join(words).split()

            label = int(words[0])
            row.append(i); col.append(0); data.append(label)

            indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
            for j in indices:   # quick-n-dirty version
                row.append(i); col.append(j); data.append(1)
    return row, col, data

r,c,d = create_sparse(fsparse, 4)
print(r,c,d)
M = sparse.coo_matrix((d,(r,c)))
print(M)
print(M.A)

生产

[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2] [0, 1, 2, 3, 0, 1, 4, 0, 2, 3, 4] [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1]
....

唯一不同的是 data 的值为 0 的项目。sparse 会处理这个问题。

【讨论】：

太棒了！这正是我想要的。

【解决方案2】：

（在明确禁止 sklearn 之前已回答）

这基本上是svmlight / libsvm format。

只需使用scikit-learn's load_svmlight_file 或更高效的svmlight-loader。无需在这里重新发明轮子！

from sklearn.datasets import load_svmlight_file

X, y = load_svmlight_file('C:/TEMP/sparse.txt')
print(X)
print(y)
print(X.todense())

输出：

(0, 0)        1.0
(0, 1)        1.0
(0, 2)        1.0
(1, 0)        1.0
(1, 3)        1.0
(2, 1)        1.0
(2, 2)        1.0
(2, 3)        1.0
[ 1.  0.  1.]
[[ 1.  1.  1.  0.]
[ 1.  0.  0.  1.]
[ 0.  1.  1.  1.]]

【讨论】：

非常感谢，但是，如果我们可以使用 numpy/scipy 而没有 sciki-learn，我将不胜感激。我正在尝试学习一种使用 numpy/scipy 的方法。解决问题不是问题，我找到了一种方法来低效地找到dense.txt，如上所述。
阅读their code and use it。那有什么问题？更快的外部也可能更容易分解（哎呀；可能是基于 cython 的......）。所以你是说这是一种学习体验，而不是现实世界的使用？
sascha 博士让我感到困惑，我很抱歉，在这个问题中，我正在学习一种使用 numpy/scipy 来完成这项任务的方法，而 sklearn 太先进了，而且已经脱离了黑匣子。
让我们看看是否有人会帮助你。我只会给你一个提示：像这样的数据格式是构建/设计为在不使用正则表达式的情况下解析的！正则表达式在计算和内存方面效率低下。