将 1.2GB 的边列表转换为稀疏矩阵答案

【问题标题】：Converting a 1.2GB list of edges into a sparse matrix将 1.2GB 的边列表转换为稀疏矩阵
【发布时间】：2016-12-05 21:32:04
【问题描述】：

我有一个来自文本文件中图形的 1.2GB 边列表。我的 ubuntu PC 有 8GB 的 RAM。输入中的每一行看起来像

287111206 357850135

我想将其转换为稀疏邻接矩阵并将其输出到文件中。

我的数据的一些统计数据：

Number of edges: around 62500000
Number of vertices: around 31250000

我之前在https://stackoverflow.com/a/38667644/2179021 上问过同样的问题，并得到了很好的答案。问题是我无法让它工作。

我首先尝试将 np.loadtxt 加载到文件中，但它非常慢并且使用了大量内存。因此，我改为使用 pandas.read_csv，它非常快，但这导致了它自己的问题。这是我当前的代码：

import pandas
import numpy as np
from scipy import sparse

data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32)
A = data.as_matrix()
print type(A)
k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
rows,cols=k3.reshape(A.shape).T
M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
print type(M)

问题是 pandas 数据框 data 很大，我实际上是在 A 中复制效率低下。然而，当代码崩溃时，事情变得更糟了

<type 'instancemethod'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 13, in <module>
    rows,cols=k3.reshape(A.shape).T
AttributeError: 'function' object has no attribute 'shape'
raph@raph-desktop:~/python$ python make-sparse-matrix.py 
<type 'numpy.ndarray'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 12, in <module>
    k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique
    iflag = np.cumsum(flag) - 1
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum
    return cumsum(axis, dtype, out)
MemoryError

所以我的问题是：

能否避免在内存中同时保存 1.2GB 的 pandas 数据帧和 1.2GB 的 numpy 数组副本？
有没有办法让代码在 8GB 的 RAM 中完成？

您可以重现我尝试处理的大小的测试输入：

import random
#Number of edges, vertices
m = 62500000
n = m/2
for i in xrange(m):
    fromnode = str(random.randint(0, n-1)).zfill(9)
    tonode = str(random.randint(0, n-1)).zfill(9)
    print fromnode, tonode

更新

我现在尝试了许多不同的方法，但都失败了。这是一个摘要。

将igraph 与g = Graph.Read_Ncol('edges.txt') 一起使用。这会使用大量 RAM，导致我的计算机崩溃。
将networkit 与G= networkit.graphio.readGraph("edges.txt", networkit.Format.EdgeList, separator=" ", continuous=False) 一起使用。这会使用大量 RAM，导致我的计算机崩溃。
此问题中的上述代码但使用 np.loadtxt("edges.txt") 而不是 pandas。这会使用大量 RAM，导致我的计算机崩溃。

然后我编写了单独的代码，将所有顶点名称重新映射为从 1..|V| 开始的数字。其中|V|是顶点的总数。这应该可以节省导入边列表的代码，而不必建立一个映射顶点名称的表。我试过用这个：

使用这个新的重新映射边缘列表文件，我再次使用 igraph 和g = Graph.Read_Edgelist("edges-contig.txt")。尽管它需要 4GB 的 RAM（这比它应该的理论量要多得多），但它现在可以工作了。但是，没有 igraph 函数可以从图中写出稀疏邻接矩阵。推荐的解决方案是convert the graph to a coo_matrix。不幸的是，这使用了大量的 RAM，导致我的计算机崩溃。
使用重新映射的边缘列表文件，我将 networkit 与 G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne) 一起使用。这也可以使用少于 igraph 所需的 4GB。 networkit 还带有一个编写 Matlab 文件的功能（这是一种 scipy 可以读取的稀疏邻接矩阵形式）。但是networkit.graphio.writeMat(G,"test.mat") 使用了大量的 RAM，导致我的计算机崩溃。

最后 sascha 的回答完成了，但需要大约 40 分钟。

【问题讨论】：

每一列中的所有数字都是唯一的吗？
@khredos 不，他们不是。我生成假数据的示例代码在这方面是不现实的。
@eleanora A的形状是什么？嗯......没关系...... A 似乎不是一个数组/矩阵。那很糟！向我们展示 data.head()！你是如何使用 np.loadtxt() 的？
关于副本：您可以在 .as_matrix 之后使用 data=None。但在此之前的问题就是这里的问题！
或许这样可以解决问题：stackoverflow.com/questions/1938894/…

标签： python pandas numpy optimization scipy

【解决方案1】：

在我的回答中，我考虑了节点的 ID 由来自[0-9A-Za-z] 的每个字符的 9 个字符长字符串给出的情况。这些节点 ID 中的 n 应映射到值 [0,n-1] 上（这对于您的应用程序可能不是必需的，但仍然具有普遍意义）。

为了完整起见，我相信您已经知道了接下来的注意事项：

内存是瓶颈。
文件中有大约10^8 字符串。
一个 9 个字符长的 string + int32 值对在字典中的成本约为 120 字节，导致文件使用 12GB 内存。
文件中的字符串 id 可以映射到int64：有 62 个不同的字符 -> 可以用 6 位编码，字符串中有 9 个字符 -> 6*9=54toInt64() 方法。
有 int64+int32=12 字节“真实”信息 => ca. 1.2 GB 就足够了，但是在字典中这样一对的成本大约是 60 字节（需要大约 6 GB RAM）。
（在堆上）创建小对象会导致大量内存开销，因此将这些对象捆绑在数组中是有利的。有关 python 对象使用的内存的有趣信息可以在他的教程 stile article 中找到。在此blog entry 中公开了减少内存使用的有趣经验。
python-list 作为数据结构和字典是没有问题的。 array.array 可以替代，但我们使用np.array（因为np.array 有排序算法，但array.array 没有）。

1.步骤： 读取文件并将字符串映射到int64。让np.array 动态增长是很痛苦的，所以我们现在假设我们现在是文件中的边数（将它放在标题中会很好，但也可以从文件大小中推断出来）：

import numpy as np

def read_nodes(filename, EDGE_CNT):   
    nodes=np.zeros(EDGE_CNT*2, dtype=np.int64)
    cnt=0
    for line in open(filename,"r"):
        nodes[cnt:cnt+2]=map(toInt64, line.split())  # use map(int, line.split()) for cases without letters
    return nodes

2。步骤：将 int64 值转换为值 [0,n-1]：

可能性A，需要3*0.8GB：

def maps_to_ids(filename, EDGE_CNT):
""" return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_ids, nodes = np.unique(nodes, return_index=True)  
    return (len(unique_ids), nodes)

可能性 B，需要 2*0.8GB，但有点慢：

def maps_to_ids(filename, EDGE_CNT):
    """ return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_map = np.unique(nodes)
    for i in xrange(len(nodes)):
        node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect
        nodes[i]=node_id  
    return (len(unique_map), nodes)

3。步骤：全部放入 coo_matrix：

from scipy import sparse
def data_as_coo_matrix(filename, EDGE_CNT)
    node_cnt, nodes = maps_to_ids(filename, EDGE_CNT)    
    rows=nodes[::2]#it is only a view, not a copy
    cols=nodes[1::2]#it is only a view, not a copy

    return sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt))

对于调用 data_as_coo_matrix("data.txt", 62500000)，内存需要达到 2.5GB 的峰值（但使用 int32 而不是 int64 只需要 1.5GB）。在我的机器上花了大约 5 分钟，但我的机器很慢......

那么与您的解决方案有什么不同？

我只从 np.unique 获得唯一值（而不是所有索引和反向），因此节省了一些内存 - 我可以用新的就地 ID 替换旧 ID。
我没有使用pandas 的经验，所以pandas 之间可能存在一些复制问题numpy 数据结构？

和sascha的方案有什么区别？

不需要一直对列表进行排序 - 在所有项目都在列表中之后进行排序就足够了，np.unique() 就是这样做的。 sascha 的解决方案始终保持列表排序 - 即使运行时间保持在O(n log(n))，您也必须为此付出至少一个不变的因素。我假设，添加操作将是 O(n)，但正如所指出的，它是 O(log(n)。

GrantJ 的解决方案有什么不同？

生成的稀疏矩阵的大小为NxN - 带有N - 不同节点的数量，而不是2^54x2^54（有很多空行和列）。

PS：
这是我的想法，如何将 9 个字符的字符串 id 映射到 int64 值，但我猜这个函数可能会成为它的编写方式的瓶颈，应该得到优化。

def toInt64(string):
    res=0L
    for ch in string:
        res*=62
        if ch <='9':
          res+=ord(ch)-ord('0')
        elif ch <='Z':
          res+=ord(ch)-ord('A')+10
        else:
          res+=ord(ch)-ord('a')+36
    return res

【讨论】：

没有检查您的解决方案，但答案看起来不错（假设、描述、比较）。只是想说，将元素添加到 sortedlist 的复杂度为O(log(n))（不是O(n)）；至少摊销了。但我同意你的批评！ @ead
我想知道熊猫是否可以执行第 1 步是一种有效的方法。熊猫专家？

【解决方案2】：

这是我的解决方案：

import numpy as np
import pandas as pd
import scipy.sparse as ss

def read_data_file_as_coo_matrix(filename='edges.txt'):
    "Read data file and return sparse matrix in coordinate format."
    data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32)
    rows = data[0]  # Not a copy, just a reference.
    cols = data[1]
    ones = np.ones(len(rows), np.uint32)
    matrix = ss.coo_matrix((ones, (rows, cols)))
    return matrix

Pandas 使用 read_csv 完成繁重的解析工作。 Pandas 已经以列格式存储数据。 data[0] 和 data[1] 只是获取引用，没有副本。然后我将它们提供给coo_matrix。本地基准：

In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix()
1 loop, best of 5: 14.2 s per loop

然后将 csr-matrix 保存到文件中：

def save_csr_matrix(filename, matrix):
    """Save compressed sparse row (csr) matrix to file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    attributes = {
        'data': matrix.data,
        'indices': matrix.indices,
        'indptr': matrix.indptr,
        'shape': matrix.shape,
    }
    np.savez(filename, **attributes)

本地基准测试：

In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr())
1 loop, best of 5: 13.4 s per loop

然后从文件中加载它：

def load_csr_matrix(filename):
    """Load compressed sparse row (csr) matrix from file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    loader = np.load(filename)
    args = (loader['data'], loader['indices'], loader['indptr'])
    matrix = ss.csr_matrix(args, shape=loader['shape'])
    return matrix

本地基准测试：

In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz')
1 loop, best of 5: 881 ms per loop

最后测试一下：

def test():
    "Test data file parsing and matrix serialization."
    coo_matrix = read_data_file_as_coo_matrix()
    csr_matrix = coo_matrix.tocsr()
    save_csr_matrix('edges.npz', csr_matrix)
    loaded_csr_matrix = load_csr_matrix('edges.npz')
    # Comparison based on http://stackoverflow.com/a/30685839/232571
    assert (csr_matrix != loaded_csr_matrix).nnz == 0

if __name__ == '__main__':
    test()

运行test()时，大约需要30秒：

$ time python so_38688062.py 
real    0m30.401s
user    0m27.257s
sys     0m2.759s

内存高水位线约为 1.79 GB。

请注意，一旦您将“edges.txt”转换为 CSR 矩阵格式的“edges.npz”，加载它只需不到一秒钟的时间。

【讨论】：

好吧！我在最后添加了 csrmat = matrix.tocsr() save_sparse_csr("test", csrmat) （使用stackoverflow.com/a/8980156/2179021），它似乎工作得很好！
如果我理解您的解决方案是正确的，您不会重新标记节点并且生成的稀疏矩阵的维度为 10^9x10^9？我不知道这是否是一个问题，但如果将生成的矩阵转换为 CSR 或 CSC 格式，将会有一些惩罚（与重新标记的节点相比，这将导致 10^8x10^8 矩阵）
我认为让赏金运行几天以防其他人正在处理它是有礼貌的。如果几天后没有其他答案，我会接受。再次感谢。
你看到沃尔特对你的回答的评论了吗？
@eleanora 谢谢。我更新了代码和基准。

【解决方案3】：

除了已经使用的方法之外，我还在尝试其他可用的方法。我发现以下做得很好。

方法 1 - 将文件读入字符串，使用 numpy 的 fromstring 将字符串解析为一维数组。

import numpy as np
import scipy.sparse as sparse

def readEdges():
    with open('edges.txt') as f:
        data = f.read()  
    edges = np.fromstring(data, dtype=np.int32, sep=' ')
    edges = np.reshape(edges, (edges.shape[0]/2, 2))
    ones = np.ones(len(edges), np.uint32)
    cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdges()

输出：

5 loops, best of 3: 13.6 s per loop

方法 2 - 与方法 1 相同，但不是使用内存映射接口将文件加载到字符串中。

def readEdgesMmap():
    with open('edges.txt') as f:
        with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m: 
            edges = np.fromstring(m, dtype=np.int32, sep=' ')
            edges = np.reshape(edges, (edges.shape[0]/2, 2))
            ones = np.ones(len(edges), np.uint32)
            cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdgesMmap()

输出：

5 loops, best of 3: 12.7 s per loop

使用/usr/bin/time 进行监控，两种方法最多使用约 2GB 内存。

几点说明：

它似乎比 pandas read_csv 稍微好一点。使用pandas read_csv，在同一台机器上的输出是

5 loops, best of 3: 16.2 s per loop
从 COO 到 CSR/CSC 的转换也需要大量时间。在@GrantJ 的回答中，由于 COO 矩阵初始化不正确，因此花费的时间更少。参数需要作为元组给出。我想在那里发表评论，但我还没有评论权。
我对为什么这比 pandas read_csv 稍好一点的猜测是一维数据的先验假设。

【讨论】：

你现在有足够的代表发表评论:)
我对您答案中矩阵的形状有些困惑。它应该是“节点数”的“节点数”，不是吗？这就是你所拥有的吗？
是的，应该是这样。我已使用您共享的代码进行测试。我为上面的 cooMatrix 得到的大小是 (31250000, 31250000)。
也可以使用 `edges = np.fromfile('data.txt', dtype=np.int32, sep=' ') 使用更少的内存，因为不是整个文本文件都在内存中整个时间。但是它有点慢（不知道为什么......）
感谢您发现错误。这是一个不幸的函数签名。我已经更新了我的代码和基准。

【解决方案4】：

更新版本

如 cmets 所示，该方法不适合您的用例。让我们做一些改变：

使用 pandas 读取数据（而不是 numpy：我很惊讶 np.loadtxt 的表现如此糟糕！）
使用外部库 sortedcontainers 获得更节省内存的方法（而不是字典）
基本方法是一样的

这种方法需要 ~45 分钟（这很慢；但您可以腌制/保存结果，因此您需要只做一次）和 约 5 GB 内存用于为您的数据准备稀疏矩阵，生成方式为：

import random
N = 62500000
for i in xrange(N):
    print random.randint(10**8,10**9-1), random.randint(10**8,10**9-1)

代码

import numpy as np
from scipy.sparse import coo_matrix
import pandas as pd
from sortedcontainers import SortedList
import time

# Read data
# global memory usage after: one big array
df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32)
data = df.as_matrix()
df = None
n_edges = data.shape[0]

# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown
# global memory usage after: one big array + one big searchtree
print('fit mapping')
start = time.time()
observed_vertices = SortedList()
mappings = np.arange(n_edges*2, dtype=np.uint32)  # upper bound on vertices
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val not in observed_vertices:
            observed_vertices.add(val)
mappings = mappings[:len(observed_vertices)]
n_vertices = len(observed_vertices)
end = time.time()
print(' secs: ', end-start)

print('transform mapping')
# Map original data (in-place !)
# global memory usage after: one big array + one big searchtree(can be deleted!)
start = time.time()
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        mapper_pos = observed_vertices.index(val)
        data[row, column] = mappings[mapper_pos]
end = time.time()
print(' secs: ', end-start)
observed_vertices = None  # if not needed anymore
mappings = None  # if not needed anymore

# Create sparse matrix (only caring about a single triangular part for now)
# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))

第一版

这是一个非常简单和非常低效（在时间和空间方面）的代码来构建这个稀疏矩阵。我发布这段代码是因为我相信如果要在更大的东西中使用这些核心部分，理解这些部分很重要。

让我们看看这段代码对于您的用例是否足够高效，或者它是否需要工作。从远处看很难说，因为我们没有你的数据。

用于映射的字典部分可能会破坏您的记忆。但是在不知道是否需要的情况下优化它是没有意义的。特别是因为这部分代码取决于图形中的顶点数（而且我对这个基数一无所知）。

""" itertools.count usage here would need changes for py2 """

import numpy as np
from itertools import count
from scipy.sparse import coo_matrix


# Read data
# global memory usage after: one big array
data = np.loadtxt('edges.txt', np.uint32)
n_edges = data.shape[0]
#print(data)
#print(data.shape)

# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown
# global memory usage after: one big array + one big dict 
index_gen = count()
mapper = {}
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val not in mapper:
            mapper[val] = next(index_gen)
n_vertices = len(mapper)

# Map original data (in-place !)
# global memory usage after: one big array + one big dict (can be deleted!)
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        data[row, column] = mapper[data[row, column]]
#print(data)

# Create sparse matrix (only caring about a single triangular part for now)
# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))
#print(sp_mat)

edges-10.txt 的输出：

[[287111206 357850135]
 [512616930 441657273]
 [530905858 562056765]
 [524113870 320749289]
 [149911066 964526673]
 [169873523 631128793]
 [646151040 986572427]
 [105290138 382302570]
 [194873438 968653053]
 [912211115 195436728]]
(10, 2)
[[ 0 10]
 [ 1 11]
 [ 2 12]
 [ 3 13]
 [ 4 14]
 [ 5 15]
 [ 6 16]
 [ 7 17]
 [ 8 18]
 [ 9 19]]
  (0, 10)   True
  (1, 11)   True
  (2, 12)   True
  (3, 13)   True
  (4, 14)   True
  (5, 15)   True
  (6, 16)   True
  (7, 17)   True
  (8, 18)   True
  (9, 19)   True

【讨论】：

不幸的是，这不起作用。开始np.loadtxt 似乎效率非常低（这就是我改用熊猫的原因）。我在具有我需要的 1/5 边缘的文件上对其进行了测试，因为当我使用整个文件时它会崩溃。加载需要超过 1 分钟（应该需要几秒钟），并且它使用大约 2GB 的 RAM 来加载数据！
您确定您的文件没有损坏吗？缺少值或类似的东西？这不应该是那么低效，也不应该像以前那样奇怪的熊猫行为。
我认为没有任何问题。您只需运行i in xrange(N): print random.randint(10**8,10**9-1), random.randint(10**8,10**9-1) 并输出到文件然后运行您的代码即可轻松重现该问题。之前我的代码中有一个错误，但我已修复。
@eleanora 我在上面做了一些更改。如果您有 >= 5GB 的内存，这应该可以工作。如果以后需要，您可能需要修改代码以存储最终矩阵和逆映射器。我只会做一次这种转换。
感谢您的更新。我确实尝试了您的新代码，但它仍然会杀死我的计算机。您介意对问题中我的新边缘列表创建代码生成的数据进行测试吗？如果你使用“/usr/bin/time -v python sascha.py”运行它，它会告诉你它使用了多少内存，假设你在 linux 中。

【解决方案5】：

你可能想看看igraph 项目，这是一个 GPL 的 C 代码库，专为这类事情而设计，并且有一个不错的 Python API。我认为在您的情况下，您的 Python 代码将类似于

from igraph import Graph
g = Graph.Read_Edgelist('edges.txt')
g.write_adjacency('adjacency_matrix.txt')

【讨论】：

应该是 g = Graph.Read_Ncol('edges.txt') 我相信。一个问题是 g.write_adjacency 写入了一个太大的密集矩阵。也许有解决办法？
另一个问题是 g = Graph.Read_Ncol('edges.txt') 似乎使用了大量的 RAM。您可以在 N = 12500000 的问题中使用我的数据生成代码对其进行测试。