为什么读取一个字节比从文件中读取 2、3、4……字节慢 20 倍？答案

【问题标题】：Why is reading one byte 20x slower than reading 2, 3, 4, ... bytes from a file?为什么读取一个字节比从文件中读取 2、3、4……字节慢 20 倍？
【发布时间】：2017-05-28 06:56:46
【问题描述】：

我一直试图了解read 和seek 之间的权衡。对于小的“跳跃”，读取不需要的数据比使用seek 跳过它更快。

在对不同的读取/查找块大小进行计时以找到临界点时，我遇到了一个奇怪的现象：read(1) 比 read(2)、read(3) 等慢大约 20 倍。这种效果对于不同的读取方法，例如read() 和 readinto()。

为什么会这样？

在计时结果中搜索以下 2/3 行：

2 x buffered 1 byte readinto bytearray

环境：

Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]

计时结果：

Non-cachable binary data ingestion (file object blk_size = 8192):
- 2 x buffered 0 byte readinto bytearray:
      robust mean: 6.01 µs +/- 377 ns
      min: 3.59 µs
- Buffered 0 byte seek followed by 0 byte readinto:
      robust mean: 9.31 µs +/- 506 ns
      min: 6.16 µs
- 2 x buffered 4 byte readinto bytearray:
      robust mean: 14.4 µs +/- 6.82 µs
      min: 2.57 µs
- 2 x buffered 7 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.76 µs
      min: 3.08 µs
- 2 x buffered 2 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.77 µs
      min: 3.08 µs
- 2 x buffered 5 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.76 µs
      min: 3.08 µs
- 2 x buffered 3 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.73 µs
      min: 2.57 µs
- 2 x buffered 49 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.72 µs
      min: 2.57 µs
- 2 x buffered 6 byte readinto bytearray:
      robust mean: 14.6 µs +/- 6.76 µs
      min: 3.08 µs
- 2 x buffered 343 byte readinto bytearray:
      robust mean: 15.3 µs +/- 6.43 µs
      min: 3.08 µs
- 2 x buffered 2401 byte readinto bytearray:
      robust mean: 138 µs +/- 247 µs
      min: 4.11 µs
- Buffered 7 byte seek followed by 7 byte readinto:
      robust mean: 278 µs +/- 333 µs
      min: 15.4 µs
- Buffered 3 byte seek followed by 3 byte readinto:
      robust mean: 279 µs +/- 333 µs
      min: 14.9 µs
- Buffered 1 byte seek followed by 1 byte readinto:
      robust mean: 279 µs +/- 334 µs
      min: 15.4 µs
- Buffered 2 byte seek followed by 2 byte readinto:
      robust mean: 279 µs +/- 334 µs
      min: 15.4 µs
- Buffered 4 byte seek followed by 4 byte readinto:
      robust mean: 279 µs +/- 334 µs
      min: 15.4 µs
- Buffered 49 byte seek followed by 49 byte readinto:
      robust mean: 281 µs +/- 336 µs
      min: 14.9 µs
- Buffered 6 byte seek followed by 6 byte readinto:
      robust mean: 281 µs +/- 337 µs
      min: 15.4 µs
- 2 x buffered 1 byte readinto bytearray:
      robust mean: 282 µs +/- 334 µs
      min: 17.5 µs
- Buffered 5 byte seek followed by 5 byte readinto:
      robust mean: 282 µs +/- 338 µs
      min: 15.4 µs
- Buffered 343 byte seek followed by 343 byte readinto:
      robust mean: 283 µs +/- 340 µs
      min: 15.4 µs
- Buffered 2401 byte seek followed by 2401 byte readinto:
      robust mean: 309 µs +/- 373 µs
      min: 15.4 µs
- Buffered 16807 byte seek followed by 16807 byte readinto:
      robust mean: 325 µs +/- 423 µs
      min: 15.4 µs
- 2 x buffered 16807 byte readinto bytearray:
      robust mean: 457 µs +/- 558 µs
      min: 16.9 µs
- Buffered 117649 byte seek followed by 117649 byte readinto:
      robust mean: 851 µs +/- 1.08 ms
      min: 15.9 µs
- 2 x buffered 117649 byte readinto bytearray:
      robust mean: 1.29 ms +/- 1.63 ms
      min: 18 µs

基准代码：

from _utils import BenchmarkResults

from timeit import timeit, repeat
import gc
import os
from contextlib import suppress
from math import floor
from random import randint

### Configuration

FILE_NAME = 'test.bin'
r = 5000
n = 100

reps = 1

chunk_sizes = list(range(7)) + [7**x for x in range(1,7)]

results = BenchmarkResults(description = 'Non-cachable binary data ingestion')


### Setup

FILE_SIZE = int(100e6)

# remove left over test file
with suppress(FileNotFoundError):
    os.unlink(FILE_NAME)

# determine how large a file needs to be to not fit in memory
gc.collect()
try:
    while True:
        data = bytearray(FILE_SIZE)
        del data
        FILE_SIZE *= 2
        gc.collect()
except MemoryError:
    FILE_SIZE *= 2
    print('Using file with {} GB'.format(FILE_SIZE / 1024**3))

# check enough data in file
required_size = sum(chunk_sizes)*2*2*reps*r
print('File size used: {} GB'.format(required_size / 1024**3))
assert required_size <= FILE_SIZE


# create test file
with open(FILE_NAME, 'wb') as file:
    buffer_size = int(10e6)
    data = bytearray(buffer_size)
    for i in range(int(FILE_SIZE / buffer_size)):
        file.write(data)

# read file once to try to force it into system cache as much as possible
from io import DEFAULT_BUFFER_SIZE
buffer_size = 10*DEFAULT_BUFFER_SIZE
buffer = bytearray(buffer_size)
with open(FILE_NAME, 'rb') as file:
    bytes_read = True
    while bytes_read:
        bytes_read = file.readinto(buffer)
    blk_size = file.raw._blksize

results.description += ' (file object blk_size = {})'.format(blk_size)

file = open(FILE_NAME, 'rb')

### Benchmarks

setup = \
"""
# random seek to avoid advantageous starting position biasing results
file.seek(randint(0, file.raw._blksize), 1)
"""

read_read = \
"""
file.read(chunk_size)
file.read(chunk_size)
"""

seek_seek = \
"""
file.seek(buffer_size, 1)
file.seek(buffer_size, 1)
"""

seek_read = \
"""
file.seek(buffer_size, 1)
file.read(chunk_size)
"""

read_read_timings = {}
seek_seek_timings = {}
seek_read_timings = {}
for chunk_size in chunk_sizes:
    read_read_timings[chunk_size] = []
    seek_seek_timings[chunk_size] = []
    seek_read_timings[chunk_size] = []

for j in range(r):
    #file.seek(0)
    for chunk_size in chunk_sizes:
        buffer = bytearray(chunk_size)
        read_read_timings[chunk_size].append(timeit(read_read, setup, number=reps, globals=globals()))
        #seek_seek_timings[chunk_size].append(timeit(seek_seek, setup, number=reps, globals=globals()))
        seek_read_timings[chunk_size].append(timeit(seek_read, setup, number=reps, globals=globals()))

for chunk_size in chunk_sizes:
    results['2 x buffered {} byte readinto bytearray'.format(chunk_size)] = read_read_timings[chunk_size]
    #results['2 x buffered {} byte seek'.format(chunk_size)] = seek_seek_timings[chunk_size]
    results['Buffered {} byte seek followed by {} byte readinto'.format(chunk_size, chunk_size)] = seek_read_timings[chunk_size]


### Cleanup
file.close()
os.unlink(FILE_NAME)

results.show()
results.save()

编辑 2020-02-24：

@finefoot 请求 _utils 包能够运行上面的代码。

from collections import OrderedDict
from math import ceil
from statistics import mean, stdev
from contextlib import suppress
import os
import inspect

class BenchmarkResults(OrderedDict):
    def __init__(self, *args, description='Benchmark Description', **kwArgs):
        self.description = description
        return super(BenchmarkResults, self).__init__(*args, **kwArgs)

    def __repr__(self):
        """Shows the results for the benchmarks in order of ascending duration"""
        characteristic_durations = []
        for name, timings in self.items():
            try:
                characteristic_durations.append(_robust_stats(timings)[0])
            except ValueError:
                if len(timings) > 1:
                    characteristic_durations.append(mean(timings))
                else:
                    characteristic_durations.append(timings[0])
        indx = _argsort(characteristic_durations)
        repr = '{}:\n'.format(self.description)
        items = list(self.items())
        for i in indx:
            name, timings = items[i]
            repr += '- {}:\n'.format(name)
            try:
                stats = _robust_stats(timings)
                repr += '      robust mean: {} +/- {}\n'.format(_units(stats[0]), _units(stats[1]))
            except ValueError:
                repr += '      timings: {}\n'.format(', '.join(map(_units, timings)))
            if len(timings) > 1:
                repr += '      min: {}\n'.format(_units(min(timings)))
        return repr

    def show(self):
        print(self)

    def save(self):
        caller = inspect.stack()[1]
        filename = os.path.splitext(caller.filename)[0] + '.log'
        with open(filename, 'w') as logfile:
            logfile.write(repr(self))


def _units(seconds, significant_figures=3):
    fmt = '{:.%sg} {}' % significant_figures
    if seconds > 1:
        return fmt.format(seconds, 's')
    elif seconds > 1e-3:
        return fmt.format(seconds*1e3, 'ms')
    elif seconds > 1e-6:
        return fmt.format(seconds*1e6, 'µs')
    elif seconds < 1e-6:
        return fmt.format(seconds*1e9, 'ns')
    elif seconds > 60:
        return fmt.format(seconds/60, 'min')
    else:
        return fmt.format(seconds/3600, 'hrs')
    raise ValueError()

def _robust_stats(timings, fraction_to_use=0.8):
    if len(timings) < 5:
        raise ValueError('To calculate a robust mean, you need at least 5 timing results')
    elts_to_prune = int(len(timings) * (1 - fraction_to_use))
    # prune at least the highest and the lowest result
    elts_to_prune = elts_to_prune if elts_to_prune > 2 else 2
    # round to even number --> symmetic pruning
    offset = ceil(elts_to_prune / 2)

    # sort the timings
    timings.sort()
    # prune the required fraction of the elements
    timings = timings[offset:-offset]
    return mean(timings), stdev(timings)

def _argsort(seq):
    # http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
    return sorted(range(len(seq)), key=seq.__getitem__)

if __name__ == '__main__':
    pass

【问题讨论】：

对于小字符串，效果取决于系统的指针大小，以及Py_UNICODE/wchar_t的大小 类型。 python.org/dev/peps/pep-0393/#performance
@veganaiZe 你能详细解释一下原因吗？我没有完全理解你的观点。这个问题似乎获得了相当多的流量，这表明我不是唯一对你的解释感兴趣的人。
对于一个最小的例子来说，这是相当多的代码。具有缓存的第一部分是重现问题所必需的吗？（没仔细看。_utils.BenchmarkResults 也不是标准库的一部分。那是哪个包？）如果你随机混淆测试的顺序，你会得到相同的结果吗？
@finefoot 对不起，没有。当前的python是否仍然存在问题？对于我的原始应用程序，无论如何我都必须读取整个文件，所以我只读取更大的块并对其进行索引。
@finefoot 抱歉，我忽略了这一点。我确实设法在我的硬盘驱动器上找到它并将其附加到 OP。如果您运行基准测试，请告诉我您是否可以重现该效果。 - 只是好奇这是否仍然是一个问题。

标签： python file io benchmarking

【解决方案1】：

我能够用您的代码重现该问题。但是，我注意到以下内容：您能否验证更换后问题是否消失

file.seek(randint(0, file.raw._blksize), 1)

与

file.seek(randint(0, file.raw._blksize), 0)

在setup?我认为您可能会在读取 1 个字节期间的某个时间点用完数据。读取 2 字节、3 字节等不会有任何数据要读取，所以速度要快得多。

【讨论】：

有时间我会看看的。但我对你的分析有些怀疑。如果你是正确的，一次读取 1 个字节将是最慢的，然后是 2 个字节、3 个字节、4 个字节等。但这不是我所看到的。在我的机器上，一次 4 个字节是最慢的，然后是 7 个字节、2 个字节、4 个字节等。这与您的分析有何关系？
如果我将 whence 参数更改为 0，所有小字节 1、2、3、... 都大致相同。对于 2401 及更高版本，显着增加。 “如果你是正确的，一次读取 1 个字节会最慢，然后是 2 个字节，3 个字节，4 个字节”我的意思是说 1 个字节很慢（因为还有数据要读取），大约 2 个字节， seek 已到达文件末尾，2、3、4 等仅读取零，因此持续时间相同。（你上面的文字中的日志显示的是什么，对吧？）
再看一遍，我相信你是对的，我的测试程序有缺陷，因为它可能会用完文件过早地读取。但是，使用 whence=0 也不是很好：1. 您可能会返回并“读取”已经缓冲的数据，2. 您可能仍然没有文件可以读取，因为它有时会寻找文件末尾。更好的可能是 file.seek(randint(0, file.raw._blksize/r/n/some_margin), 1)。你怎么看？
我真的没有太多关于缓冲和非缓冲 IO 行为的经验。我只是偶然发现了你的帖子并稍微摆弄了一下，注意到whence 的变化带来的不同。 ;) 作为第一步，我可能会尝试进一步清理和减少代码。您不需要超过 0、1 和 2 个卡盘尺寸来显示问题。这应该从根本上简化事情。另外，我注意到：如果你在chuck_sizes 中交换 1 和 2，所以现在是 [0, 2, 1, ...]，那么 2 是需要更长的时间。

【解决方案2】：

逐字节读取文件句柄通常比读取分块要慢。

通常，每个 read() 调用都对应于 Python 中的 C read() 调用。总结果涉及请求下一个字符的系统调用。对于 2 kb 的文件，这意味着 2000 次内核调用；每个都需要一个函数调用，向内核请求，然后等待响应，通过返回传递。

这里最值得注意的是awaiting response，系统调用会阻塞，直到你的调用在队列中被确认，所以你必须等待。

调用越少越好，所以字节越多越快；这就是为什么缓冲 io 相当普遍的原因。

在 python 中，可以通过io.BufferedReader 或通过open 上的buffering 关键字参数来提供缓冲

【讨论】：

谢谢，但我相信你错过了我的问题的重点。 xread(1) 调用的数量（是？）比x 的数量慢得多，例如read(2) 来电。两种情况下系统调用的数量是相同的！
在测试中，是否调整了可用字节数以适应该指标？因为如果字节用完，实现很有可能会立即返回 0 字节

【解决方案3】：

在处理与 EEPROM 接口的 arduino 时，我也看到过类似的情况。基本上，为了写入或读取芯片或数据结构，您必须发送写入/读取启用命令，发送起始位置，然后抓取第一个字符。然而，如果你抓取多个字节，大多数芯片会自动增加它们的目标地址寄存器。因此，启动读/写操作会产生一些开销。区别在于：

开始通讯
发送读取启用
发送读取命令
发送地址1
从目标 1 获取数据
结束通讯
开始通讯
发送读取启用
发送读取命令
发送地址2
从目标 2 获取数据
结束通讯

和

开始通讯
发送读取启用
发送读取命令
发送地址1
从目标 1 获取数据
从目标 2 获取数据
结束通讯

就机器指令而言，一次读取多个位/字节可以清除大量开销。当某些芯片要求您在发送读/写启用后空闲几个时钟周期以让机械过程将晶体管物理移动到位以启用读取或写入时，情况会更糟。

【讨论】：

OP 在 Intel x86 CPU 上，它肯定有高效的字节加载/存储指令（至少对于可缓存的内存区域）。问题中的“不可缓存”是关于 file 缓存，而不是不可缓存的内存；内存读/写将写回可缓存的内存区域。（Are there any modern/ancient CPUs / microcontrollers where a cached byte store is actually slower than a word store?：是的，一些非 x86 CPU 的字节存储速度稍慢。但 x86 没有惩罚）。但是这个问题是用 Python 做的，所以软件太多了