Python：如何获得指向字节数组的可变切片？答案

【问题标题】：Python: how can I get a mutable slice pointing to a byte array?Python：如何获得指向字节数组的可变切片？
【发布时间】：2016-02-15 07:44:41
【问题描述】：

我想要一个buffer 的版本，它指向bytearray 并且是可变的。我想将它传递给像 io.BufferedIOBase.readinto() 这样的 I/O 函数，而不会在循环中产生内存分配开销。

import sys, struct

ba = bytearray(2000)
lenbuf = bytearray(8)

with open(sys.argv[1]) as fp:
  while True:
    fp.readinto(lenbuf)  # efficient version of fp.read(8)
    dat_len = struct.unpack("Q", lenbuf)
    buf = buffer(ba, 0, dat_len)
    fp.readinto(buf)  # efficient version of fp.read(dat_len), but
                      # yields TypeError: must be read-write buffer, not buffer
    my_parse(buf)

我也尝试了buf =memoryview(buffer(ba, 0, length))，但得到了（基本上）同样的错误。

我认为使用 Python 不应该等同于不关注运行时性能。

我默认使用安装在 Cent6 上的 Python 2.6，但如果确实需要，可以切换到 2.7 或 3.x。

谢谢！

更新

~~bytearray 的切片行为让我感到困惑。下面的文字记录表明我可以简单地从bytearray 中取出一个片段：~~

>>> x = bytearray(10**8)
>>> cProfile.run('x[10:13]="abc"')
         2 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

>>> x.count(b'\x00')
3999999997
>>> len(x)
4000000000

>>> cProfile.run('x[10:13]="abcd"')  # intentionally try an inefficient case
         2 function calls in 0.750 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.750    0.750    0.750    0.750 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

>>> len(x)
4000000001

但是，“可变切片”在分配单个字节时无法按预期工作：

>>> x = bytearray(4*10**9)
>>> x = bytearray(10)
>>> x[2] = 0xff
>>> x.count(b'\x00')
9
>>> x[3:5][0] = 0xff
>>> x.count(b'\x00')
9  # WHAT

我不会在我的应用程序中真正使用单字节分配，但我担心是否存在任何根本性的误解。

【问题讨论】：

当你提到的所有函数实际上都需要一个字节数组时，为什么你需要一个缓冲区？
因为这些 I/O 函数尝试填充 len(buf) 字节，但我想继续重用单个“足够长”的缓冲区 (bytearray(2000))
我很想知道您的代码和@ALGOholic 代码之间是否有任何性能改进。因为坦率地说，通过垃圾收集，尝试修复假定的内存分配开销是相当大胆的。

标签： python python-2.7 io

【解决方案1】：

您可以让它读取多余的数据，然后简单地使用字节数组中的所有多余数据，然后再从文件中读取更多数据。

否则你可以使用numpy：

import sys, struct
import numpy as np

buf = np.zeros(2000, dtype=np.uint8)
lenbuf = bytearray(8)

with open(sys.argv[1]) as fp:
    while True:
        fp.readinto(lenbuf)
        dat_len = struct.unpack("Q", lenbuf)
        fp.readinto(buf[:dat_len])
        my_parse(buf[:dat_len])

numpy 创建您需要的读写缓冲区，索引 [:dat_len] 返回数据子集的“视图”而不是副本。由于 numpy 数组符合缓冲区协议，您可以进一步将它们与 struct.unpack() 一起使用，就好像它们是字节数组/缓冲区一样。

【讨论】：

"你可以让它读取多余的数据，然后在从文件中读取更多数据之前简单地使用字节数组中的所有多余数据。"抱歉，您实际上是在说我应该自己在这里实现缓冲 I/O。但感谢您让我了解 NumPy 数组类型。