如何从文本文件映射二维数组答案

【问题标题】：How to mmap a 2d array from a text file如何从文本文件映射二维数组
【发布时间】：2020-05-17 17:10:27
【问题描述】：

我有非常大的文件，其中包含二维正整数数组

每个文件都包含一个矩阵

我想在不将文件读入内存的情况下处理它们。幸运的是，我只需要在输入文件中从左到右查看值。我希望能够mmap 每个文件，这样我就可以像处理它们在内存中一样处理它们，但实际上不需要将文件读入内存。

小版本示例：

[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17], 
 [3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0], 
 [4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13], 
 [5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]

是否可以 mmap 这样的文件，然后我可以处理 np.int64 值

for i in range(rownumber):
    for j in range(rowlength):
        process(M[i, j])

明确地说，我不希望将所有输入文件都保存在内存中，因为它不适合。

【问题讨论】：

Numpy 的 ndarray 有一个 builtin memory-mapped version。
@bnaecker 太好了。我得到了问题中描述的输入，所以没有任何选择和我需要处理的数据。
numpy.memmap 似乎是您的最佳选择。我会说修改你必须兼容的大文件
@Pani 我该怎么做？
@Anush 你碰巧事先知道你输入的大小吗？

标签： python numpy mmap

【解决方案1】：

更新答案

根据您的 cmets 和说明，您实际上有一个 text 文件，其中包含一堆方括号，大约 4 行长，每行 1,000,000,000 个 ASCII 整数，以逗号分隔.不是一种非常有效的格式！我建议您简单地预处理文件以删除所有方括号、换行符和空格并将逗号转换为换行符，以便您可以轻松处理每行一个值。

使用tr命令音译，就是这样：

# Delete all square brackets, newlines and spaces, change commas into newlines
tr -d '[] \n' < YourFile.txt | tr , '\n' > preprocessed.txt

然后您的文件看起来像这样，您可以轻松地在 Python 中一次处理一个值。

2
2
6
10
2
6
...
...

如果您在 Windows 上，tr 工具可用于 Windows，GNUWin32 和 Windows 子系统 for Linux 事物（git bash？）。

您可以更进一步，制作一个可以memmap() 的文件，就像我的回答的第二部分一样，然后您可以随机找到文件中的任何字节。因此，使用上面创建的preprocessed.txt，您可以制作这样的二进制版本：

import struct

# Make binary memmapable version
with open('preprocessed.txt', 'r') as ifile, open('preprocessed.bin', 'wb') as ofile:
    for line in ifile:
        ofile.write(struct.pack('q',int(line)))

原答案

你可以这样做。第一部分只是设置：

#!/usr/bin/env python3

import numpy as np

# Create 2,4 Numpy array of int64
a = np.arange(8, dtype=np.int64).reshape(2,4)

# Write to file as binary
a.tofile('a.dat')

现在通过将文件十六进制转储到 shell 中来检查文件：

xxd a.dat

00000000: 0000 0000 0000 0000 0100 0000 0000 0000  ................
00000010: 0200 0000 0000 0000 0300 0000 0000 0000  ................
00000020: 0400 0000 0000 0000 0500 0000 0000 0000  ................
00000030: 0600 0000 0000 0000 0700 0000 0000 0000  ................

现在我们都设置好了，让我们memmap() 文件：

# Memmap file and access values via 'mm'
mm = np.memmap('a.dat', dtype=np.int64, mode='r', shape=(2,4))

print(mm[1,2])      # prints 6

【讨论】：

我可以将我的输入转换成这种 mmaped 文件，而无需将其全部存储在内存中吗？这就是我想要实现的目标。
我以为你已经有了......你的问题是你有大文件。你真正拥有什么？
我有问题中的大文件。我可以直接映射它而不进行任何转换吗？
我看不到任何附加到问题或在问题中创建的文件。
上面写着“小版本示例”

【解决方案2】：

主要问题是文件太大，而且似乎也没有分行。（供参考，array.txt 是您提供的示例，arr_map.dat 是一个空文件）

import re
import numpy as np 

N = [str(i) for i in range(10)]
arrayfile = 'array.txt'
mmapfile = 'arr_map.dat'
R = 4
C = 17
CHUNK = 20

def read_by_chunk(file, chunk_size=CHUNK):
    return file.read(chunk_size)

fp = np.memmap(mmapfile, dtype=np.uint8, mode='w+', shape=(R,C)) 

with open(arrayfile,'r') as f:
    curr_row = curr_col = 0
    while True:
        data = read_by_chunk(f)
        if not data:
            break

        # Make sure that chunk reading does not break a number
        while data[-1] in N:
            data += read_by_chunk(f,1)

        # Convert chunk into numpy array
        nums = np.array(re.findall(r'[0-9]+', data)).astype(np.uint8)
        num_len = len(nums)

        if num_len == 0:
            break

        # CASE 1: Number chunk can fit into current row
        if curr_col + num_len <= C: 
            fp[curr_row, curr_col : curr_col + num_len] = nums

            curr_col = curr_col + num_len

        # CASE 2: Number chunk has to be split into current and next row
        else: 
            col_remaining = C-curr_col
            fp[curr_row, curr_col : C] = nums[:col_remaining] # Fill in row i

            curr_row, curr_col = curr_row+1, 0                # Move to row i+1 and fill the rest
            fp[curr_row, :num_len-col_remaining] = nums[col_remaining:]

            curr_col = num_len-col_remaining

        if curr_col>=C:
            curr_col = curr_col%C
            curr_row += 1

        #print('\n--debug--\n',fp,'\n--debug--\n')

基本上，一次读取数组文件的一小部分（确保不要破坏数字），使用正则表达式从逗号、括号等垃圾字符中查找数字，然后将数字插入内存映射.

【讨论】：

你是对的，它没有分成几行。我会尽快尝试你的答案。
根据需要调整 CHUNK 大小。 CHUNK=20 基本上意味着一次从文件中读取 20 个字符。当您阅读大量文件时，您可能需要一次阅读更多内容以加快速度。

【解决方案3】：

您描述的情况似乎更适合从文件中获取下一个整数或下一行并允许您处理它的生成器。

def sanify(s):
    while s.startswith('['):
        s = s[1:]
    while s.endswith(']'):
        s = s[:-1]
    return int(s)


def get_numbers(file_obj):
    file_obj.seek(0)
    i = j = 0
    for line in file_obj:
        for item in line.split(', '):
            if item and not item.isspace():
                yield sanify(item), i, j
                j += 1
        i += 1
        j = 0

这样可以确保每次只有一行驻留在内存中。

这可以像这样使用：

import io


s = '''[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17], 
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0], 
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13], 
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]'''


items = get_numbers(io.StringIO(s))
for item, i, j in items:
    print(item, i, j)

如果您真的希望能够访问矩阵的任意元素，您可以将上述逻辑调整为实现__getitem__ 的类，您只需要跟踪每行开头的位置。在代码中，这看起来像：

class MatrixData(object):
    def __init__(self, file_obj):
        self._file_obj = file_obj
        self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
        file_obj.seek(0)
        row = list(self._read_row(file_obj.readline()))
        self.shape = len(self._line_offsets), len(row)
        self.length = self.shape[0] * self.shape[1]


    def __len__(self):
        return self.length


    def __iter__(self):
        self._file_obj.seek(0)
        i = j = 0
        for line in self._file_obj:
            for item in _read_row(line):
                    yield item, i, j
                    j += 1
            i += 1
            j = 0


    def __getitem__(self, indices):
        i, j = indices
        self._file_obj.seek(self._line_offsets[i])
        line = self._file_obj.readline()
        row = self._read_row(line)
        return row[j]


    @staticmethod
    def _get_line_offsets(file_obj):
        file_obj.seek(0)
        yield file_obj.tell()
        for line in file_obj:
            yield file_obj.tell()


    @staticmethod
    def _read_row(line):
        for item in line.split(', '):
            if item and not item.isspace():
                yield MatrixData._sanify(item)


    @staticmethod
    def _sanify(item, dtype=int):
        while item.startswith('['):
            item = item[1:]
        while item.endswith(']'):
            item = item[:-1]
        return dtype(item)


class MatrixData(object):
    def __init__(self, file_obj):
        self._file_obj = file_obj
        self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
        file_obj.seek(0)
        row = list(self._read_row(file_obj.readline()))
        self.shape = len(self._line_offsets), len(row)
        self.length = self.shape[0] * self.shape[1]


    def __len__(self):
        return self.length


    def __iter__(self):
        self._file_obj.seek(0)
        i = j = 0
        for line in self._file_obj:
            for item in self._read_row(line):
                    yield item, i, j
                    j += 1
            i += 1
            j = 0


    def __getitem__(self, indices):
        i, j = indices
        self._file_obj.seek(self._line_offsets[i])
        line = self._file_obj.readline()
        row = list(self._read_row(line))
        return row[j]


    @staticmethod
    def _get_line_offsets(file_obj):
        file_obj.seek(0)
        yield file_obj.tell()
        for line in file_obj:
            yield file_obj.tell()


    @staticmethod
    def _read_row(line):
        for item in line.split(', '):
            if item and not item.isspace():
                yield MatrixData._sanify(item)


    @staticmethod
    def _sanify(item, dtype=int):
        while item.startswith('['):
            item = item[1:]
        while item.endswith(']'):
            item = item[:-1]
        return dtype(item)

用作：

m = MatrixData(io.StringIO(s))

# get total number of elements
len(m)

# get number of row and col
m.shape

# access a specific element
m[3, 12]

# iterate through
for x, i, j in m:
    ...

【讨论】：

@Anush 编辑后的版本现在有一个类，可以让您按照您最初想要的语法使用它。

【解决方案4】：

这似乎正是 mmap 模块在 python 中所做的。见：https://docs.python.org/3/library/mmap.html

文档中的示例

import mmap

# write a simple example file
with open("hello.txt", "wb") as f:
    f.write(b"Hello Python!\n")

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()

【讨论】：

这似乎将输入视为我认为的字符串。所以在我的例子中使用你的方法我得到print(mm[:5]) b'[[2, ' 但我希望M[1,3] 是11，例如。
它确实将其视为字符串。你可以解析它

【解决方案5】：

这取决于您想对输入矩阵执行的操作，如果是矩阵操作，那么您可以使用部分矩阵，大多数时候您可以部分处理小批量的输入文件作为部分矩阵，通过这种方式，您可以非常高效地处理文件，您只需要开发算法来读取和部分处理输入并缓存结果，对于某些操作您可能只需要确定输入矩阵的最佳表示形式（即row major或column major）。

使用部分矩阵方法的主要优点是，您可以利用并行处理技术在每次迭代中使用 处理 n 部分矩阵 >以CUDA GPU为例，如果您熟悉C或C++，那么使用Python C API可能会缩短时间部分矩阵运算的复杂性很大，但即使使用 Python 也不会差太多，因为您只需要使用 Numpy 处理您的 部分矩阵。

【讨论】：