如何从 Python 中的文件一次读取一个字符？答案

【问题标题】：How to read a single character at a time from a file in Python?如何从 Python 中的文件一次读取一个字符？
【发布时间】：2011-02-28 14:44:44
【问题描述】：

谁能告诉我该怎么做？

【问题讨论】：

标签： python file-io character

【解决方案1】：

with open(filename) as f:
  while True:
    c = f.read(1)
    if not c:
      print "End of file"
      break
    print "Read a character:", c

【讨论】：

既然这是一次读取一个字节，非ASCII编码会不会失败？
问答是混淆字符和字节的概念。如果文件在每个字符编码（例如 Ascii 和许多其他编码）中使用单个字节，那么是的，您正在通过读取单个字节大小的块来读取单个字符，否则如果编码每个字符需要多个单个字节，那么您就是只读取一个字节而不是单个字符。
没错。因此，我经常做result = open(filename).read()，然后逐字阅读result。
对 David Chouinard 的问题：这个 sn-p 在 Python 3 中正确工作，文件采用 UTF-8 编码。例如，如果您有 Windows-1250 编码的文件，只需将第一行更改为 with open(filename, encoding='Windows-1250') as f:
添加到 SergO 中，open(filename, "r") 与 open(filename, "rb") 会导致不同的迭代次数（至少在 Python 3 中如此）。 “r”模式可以读取多个字节以获取c，如果它遇到了适当的特殊字符。

【解决方案2】：

首先，打开一个文件：

with open("filename") as fileobj:
    for line in fileobj:  
       for ch in line: 
           print(ch)

这会遍历文件中的每一行，然后是该行中的每个字符。

【讨论】：

同意，这似乎是更 Pythonic 的方式。这不也可以处理非 ASCII 编码吗？
您可能一次读取一个文件的一个原因是文件太大而无法放入内存。但上面的答案假设每一行都可以放入内存中。
编辑它以匹配 Python 3。
由于 OP 从未提到一次读取整个文件一个字符，因此这种方法不是最佳的，因为整个文件可以包含在一行中；在这种情况下，在完成字符处理之前需要花费大量时间来读取整行。在这些情况下，最好在部分读取上使用 f.read(1)。
-1。附议@CS 的评论。 OP询问如何阅读“一次一个字符”，所以这不能回答这个问题。这并不比公认的答案简单，最好有一个函数，有时不会不必要地使您的脚本/应用程序崩溃。如果它是一个完整表的 SQL INSERT 怎么办？还是使用非本地换行符？最好的情况是缓冲效率低下；最坏的情况是内存不足。

【解决方案3】：

我喜欢公认的答案：它简单明了，可以完成工作。我还想提供一个替代实现：

def chunks(filename, buffer_size=4096):
    """Reads `filename` in chunks of `buffer_size` bytes and yields each chunk
    until no more characters can be read; the last chunk will most likely have
    less than `buffer_size` bytes.

    :param str filename: Path to the file
    :param int buffer_size: Buffer size, in bytes (default is 4096)
    :return: Yields chunks of `buffer_size` size until exhausting the file
    :rtype: str

    """
    with open(filename, "rb") as fp:
        chunk = fp.read(buffer_size)
        while chunk:
            yield chunk
            chunk = fp.read(buffer_size)

def chars(filename, buffersize=4096):
    """Yields the contents of file `filename` character-by-character. Warning:
    will only work for encodings where one character is encoded as one byte.

    :param str filename: Path to the file
    :param int buffer_size: Buffer size for the underlying chunks,
    in bytes (default is 4096)
    :return: Yields the contents of `filename` character-by-character.
    :rtype: char

    """
    for chunk in chunks(filename, buffersize):
        for char in chunk:
            yield char

def main(buffersize, filenames):
    """Reads several files character by character and redirects their contents
    to `/dev/null`.

    """
    for filename in filenames:
        with open("/dev/null", "wb") as fp:
            for char in chars(filename, buffersize):
                fp.write(char)

if __name__ == "__main__":
    # Try reading several files varying the buffer size
    import sys
    buffersize = int(sys.argv[1])
    filenames  = sys.argv[2:]
    sys.exit(main(buffersize, filenames))

我建议的代码与您接受的答案基本相同：从文件中读取给定数量的字节。不同之处在于它首先读取一大块数据（4006 是 X86 的一个很好的默认值，但您可能想尝试 1024 或 8192；您的页面大小的任意倍数），然后它产生该块中的字符 one减一。

对于较大的文件，我提供的代码可能会更快。以the entire text of War and Peace, by Tolstoy 为例。这些是我的计时结果（Mac Book Pro 使用 OS X 10.7.4；so.py 是我给我粘贴的代码起的名字）：

$ time python so.py 1 2600.txt.utf-8
python so.py 1 2600.txt.utf-8  3.79s user 0.01s system 99% cpu 3.808 total
$ time python so.py 4096 2600.txt.utf-8
python so.py 4096 2600.txt.utf-8  1.31s user 0.01s system 99% cpu 1.318 total

现在：不要将4096 的缓冲区大小作为普遍真理；看看我得到的不同大小的结果（缓冲区大小（字节）与墙时间（秒））：

如您所见，您可以更早地开始看到收益（而且我的时间安排可能非常不准确）；缓冲区大小是性能和内存之间的权衡。默认值 4096 只是一个合理的选择，但一如既往地先测量。

【讨论】：

【解决方案4】：

只是：

myfile = open(filename)
onecharacter = myfile.read(1)

【讨论】：

【解决方案5】：

Python 本身可以在交互模式下帮助您：

>>> help(file.read)
Help on method_descriptor:

read(...)
    read([size]) -> read at most size bytes, returned as a string.

    If the size argument is negative or omitted, read until EOF is reached.
    Notice that when in non-blocking mode, less data than what was requested
    may be returned, even if no size parameter was given.

【讨论】：

我同意这种观点，但也许这更适合作为对 OP 的评论？
可能是这样，但我认为所有这些文字在评论中都会显得凌乱。

【解决方案6】：

我今天在看 Raymond Hettinger 的 Transforming Code into Beautiful, Idiomatic Python 时学到了一个新的成语：

import functools

with open(filename) as f:
    f_read_ch = functools.partial(f.read, 1)
    for ch in iter(f_read_ch, ''):
        print 'Read a character:', repr(ch)

【讨论】：

【解决方案7】：

只读一个字符

f.read(1)

【讨论】：

【解决方案8】：

这也可以：

with open("filename") as fileObj:
    for line in fileObj:  
        for ch in line:
            print(ch)

它遍历文件中的每一行以及每一行中的每个字符。

（请注意，这篇文章现在看起来与高度赞成的答案非常相似，但在撰写本文时并非如此。）

【讨论】：

-1。这是一种不好的通用方法，因为它将潜在的大量行加载到内存中。另外，它并不比公认的答案简单。如果它是一个 1000 亿长的核苷酸序列 (ATGC) 怎么办？还是一个完整表的 SQL INSERT？还是使用非本地换行符？最好的情况是缓冲效率低下；最坏的情况是内存不足。
非常正确；这效率不高。但是对于 Python 的初学者来说，这通常是一种简单的 for 循环方法，并且立即有意义。

【解决方案9】：

Python 3.8+ 的最佳答案：

with open(path, encoding="utf-8") as f:
    while c := f.read(1):
        do_my_thing(c)

您可能希望指定 utf-8 并避免使用平台编码。我选择在这里这样做。

功能 - Python 3.8+：

def stream_file_chars(path: str):
    with open(path) as f:
        while c := f.read(1):
            yield c

函数 - Python

def stream_file_chars(path: str):
    with open(path, encoding="utf-8") as f:
        while True:
            c = f.read(1)
            if c == "":
                break
            yield c

功能——pathlib + 文档：

from pathlib import Path
from typing import Union, Generator

def stream_file_chars(path: Union[str, Path]) -> Generator[str, None, None]:
    """Streams characters from a file."""
    with Path(path).open(encoding="utf-8") as f:
        while (c := f.read(1)) != "":
            yield c

【讨论】：

【解决方案10】：

你应该试试f.read(1)，这绝对是正确的，也是正确的做法。

【讨论】：

【解决方案11】：

f = open('hi.txt', 'w')
f.write('0123456789abcdef')
f.close()
f = open('hej.txt', 'r')
f.seek(12)
print f.read(1) # This will read just "c"

【讨论】：

欢迎来到 Stackoverflow！您应该详细说明 - 为什么这是一个答案？

【解决方案12】：

做一个补充，如果您正在读取包含非常大的行的文件，这可能会破坏您的记忆，您可以考虑将它们读入缓冲区然后产生每个字符

def read_char(inputfile, buffersize=10240):
    with open(inputfile, 'r') as f:
        while True:
            buf = f.read(buffersize)
            if not buf:
                break
            for char in buf:
                yield char
        yield '' #handle the scene that the file is empty

if __name__ == "__main__":
    for word in read_char('./very_large_file.txt'):
        process(char)

【讨论】：

【解决方案13】：

os.system("stty -icanon -echo")
while True:
    raw_c = sys.stdin.buffer.peek()
    c = sys.stdin.read(1)
    print(f"Char: {c}")

【讨论】：

【解决方案14】：

#reading out the file at once in a list and then printing one-by-one
f=open('file.txt')
for i in list(f.read()):
    print(i)

【讨论】：

虽然这可能会回答作者的问题，但它缺少一些解释性词语和文档链接。如果没有围绕它的一些短语，原始代码 sn-ps 并不是很有帮助。您可能还会发现how to write a good answer 非常有帮助。请编辑您的答案。
你不需要演员表来列出。
-1。强制转换为列表不必要地将整个内容加载到内存中，这可能导致 OOM 和/或低效缓冲。 OP询问如何阅读“一次一个字符”，所以这没有回答问题。