Python IOError 无法分配内存，虽然有很多答案

【问题标题】：Python IOError cannot allocate memory although there is plentyPython IOError 无法分配内存，虽然有很多
【发布时间】：2013-07-04 13:39:31
【问题描述】：

我编写了一个基本程序来检查包含许多 jpeg 文件 (500000+) 的目录树验证它们没有损坏（大约 3-5% 的文件似乎以某种方式损坏），然后对文件（甚至是损坏的文件）进行 sha1sum 并将信息保存到数据库中。

有问题的 jpeg 文件位于 windows 系统上，并通过 cifs 安装在 linux 机器上。它们的大小大多约为 4 MB，尽管有些可能略大或略小。

当我运行该程序时，它似乎运行得相当好一段时间，然后它因以下错误而崩溃。这是在它处理了大约 1100 个文件之后（错误表明在尝试打开 4.5 兆的文件时出现问题）。

现在我知道我可以捕捉到这个错误并继续或重试等，但我很好奇它为什么会首先发生，以及捕捉和重试是否真的能解决问题 - 或者它会不会重试卡住了（除非我当然限制了重试但随后跳过了一个文件）。

我在 debian 系统上使用“Python 2.7.5+”来运行它。系统至少有 4 Gig（可能是 8 个）的 ram，并且 top 报告脚本在运行时的任何时候都使用不到 1% 的 ram 和不到 3% 的 cpu。同样，此脚本运行的 jpeginfo 也使用同样少量的内存和 cpu。

为了避免在读取文件时使用太多内存，我采用了另一个问题的答案中给出的方法：https://stackoverflow.com/a/1131255/289545

您还可能注意到“jpeginfo”命令处于等待“[OK]”响应的 while 循环中。这是因为如果“jpeginfo”认为它找不到文件，它会返回 0，因此 subprocess.check_output 调用不会将其视为错误状态。

我确实想知道 jpeginfo 在第一次尝试时似乎无法找到某些文件这一事实是否可能相关（我怀疑是这样），但返回的错误是无法分配内存而不是找不到文件。

错误：

Traceback (most recent call last):
  File "/home/m3z/jpeg_tester", line 95, in <module>
    main()
  File "/home/m3z/jpeg_tester", line 32, in __init__
    self.recurse(self.args.dir, self.scan)
  File "/home/m3z/jpeg_tester", line 87, in recurse
    cmd(os.path.join(root, name))
  File "/home/m3z/jpeg_tester", line 69, in scan
    with open(filepath) as f:
IOError: [Errno 12] Cannot allocate memory: '/path/to/file name.jpg'

完整的程序代码：

  1 #!/usr/bin/env python
  2
  3 import os
  4 import time
  5 import subprocess
  6 import argparse
  7 import hashlib
  8 import oursql as sql
  9
 10
 11
 12 class main:
 13     def __init__(self):
 14         parser = argparse.ArgumentParser(description='Check jpeg files in a given directory for errors')
 15         parser.add_argument('dir',action='store', help="absolute path to the directory to check")
 16         parser.add_argument('-r, --recurse', dest="recurse", action='store_true', help="should we check subdirectories")
 17         parser.add_argument('-s, --scan', dest="scan", action='store_true', help="initiate scan?")
 18         parser.add_argument('-i, --index', dest="index", action='store_true', help="should we index the files?")
 19
 20         self.args = parser.parse_args()
 21         self.results = []
 22
 23         if not self.args.dir.startswith("/"):
 24                 print "dir must be absolute"
 25                 quit()
 26
 27         if self.args.index:
 28                 self.db = sql.connect(host="localhost",user="...",passwd="...",db="fileindex")
 29                 self.cursor = self.db.cursor()
 30
 31         if self.args.recurse:
 32                 self.recurse(self.args.dir, self.scan)
 33         else:
 34                 self.scan(self.args.dir)
 35
 36         if self.db:
 37                 self.db.close()
 38
 39         for line in self.results:
 40                 print line
 41
 42
 43
 44     def scan(self, dirpath):
 45         print "Scanning %s" % (dirpath)
 46         filelist = os.listdir(dirpath)
 47         filelist.sort()
 48         total = len(filelist)
 49         index = 0
 50         for filen in filelist:
 51                 if filen.lower().endswith(".jpg") or filen.lower().endswith(".jpeg"):
 52                         filepath = os.path.join(dirpath, filen)
 53                         index = index+1
 54                         if self.args.scan:
 55                                 try:
 56                                         procresult = subprocess.check_output(['jpeginfo','-c',filepath]).strip()
 57                                         while "[OK]" not in procresult:
 58                                                 time.sleep(0.5)
 59                                                 print "\tRetrying %s" % (filepath)
 60                                                 procresult = subprocess.check_output(['jpeginfo','-c',filepath]).strip()
 61                                         print "%s/%s: %s" % ('{:>5}'.format(str(index)),total,procresult)
 62                                 except subprocess.CalledProcessError, e:
 63                                         os.renames(filepath, os.path.join(dirpath, "dodgy",filen))
 64                                         filepath = os.path.join(dirpath, "dodgy", filen)
 65                                         self.results.append("Trouble with: %s" % (filepath))
 66                                         print "%s/%s: %s" % ('{:>5}'.format(str(index)),total,e.output.strip())
 67                         if self.args.index:
 68                                 sha1 = hashlib.sha1()
 69                                 with open(filepath) as f:
 70                                         while True:
 71                                                 data = f.read(8192)
 72                                                 if not data:
 73                                                         break
 74                                                 sha1.update(data)
 75                                 sqlcmd = ("INSERT INTO `index` (`sha1`,`path`,`filename`) VALUES (?, ?, ?);", (buffer(sha1.digest()), dirpath, filen))
 76                                 self.cursor.execute(*sqlcmd)
 77
 78
 79     def recurse(self, dirpath, cmd, on_files=False):
 80         for root, dirs, files in os.walk(dirpath):
 81             if on_files:
 82                 for name in files:
 83                     cmd(os.path.join(root, name))
 84             else:
 85                 cmd(root)
 86                 for name in dirs:
 87                     cmd(os.path.join(root, name))
 88
 89
 90
 91
 92
 93
 94 if __name__ == "__main__":
 95     main()

【问题讨论】：

您的程序仍有大量内存，但可能已耗尽其他资源。也许文件描述符？如果您注释掉子流程调用，您仍然会收到异常吗？
您不需要用f.close() 关闭with open(filepath) as f 的文件吗？请原谅我，因为我也是 python 新手。
@shahkalpesh：不，with 会在您离开街区后立即处理。
@RickyA：谢谢。这就像 c# 中的 using 块。 :)

标签： python ioerror

【解决方案1】：

在我看来，Python 只是传递了来自底层 open() 调用的错误，而真正的罪魁祸首是 Linux CIFS 支持——我怀疑 Python 会合成 ENOMEM，除非系统内存真的耗尽（并且可能即使在那时我也希望调用 Linux OOM killer 而不是获得ENOMEM）。

不幸的是，它可能需要一些 Linux 文件系统专家来弄清楚那里发生了什么，但是查看 sources for CIFS in the Linux kernel，我可以看到在各种特定于内核的资源耗尽时返回 ENOMEM 的各种地方而不是总系统内存，但我对它还不够熟悉，无法说出它们中的任何一个的可能性。

要排除任何特定于 Python 的内容，您可以在 strace 下运行该进程，这样您就可以看到 Python 从 Linux 获得的确切返回码。为此，请运行如下命令：

strace -eopen -f python myscript.py myarg1 myarg2 2>strace.log

-f 将跟随子进程（即您运行的 jpeginfo 命令），-eopen 将只显示 open() 调用而不是所有系统调用（这是 strace 所做的默认）。这可以生成合理数量的输出，这就是我在上面的示例中将其重定向到文件的原因，但如果您愿意，可以将其显示在终端上。

我希望您在遇到异常之前会看到类似的内容：

open("/path/to/file name.jpg", O_RDONLY) = -1 ENOMEM (Cannot allocate memory)

如果是这样，则此错误直接来自文件系统 open() 调用，您在 Python 脚本中几乎无能为力。如果jpeginfo 失败，您可以捕获异常并重试（可能在短暂的延迟后），但如果不知道首先导致错误的原因，很难说这种策略会有多成功。

当然，您可以将文件复制到本地，但由于文件太多，这听起来会很痛苦。

编辑：顺便说一句，您会看到很多与您的脚本无关的open() 调用，因为strace 正在跟踪每个例如，由 Python 发出的调用，其中包括它打开自己的 .py 和 .pyc 文件。忽略那些不引用您感兴趣的文件的文件。

【讨论】：

哇，感谢您的详细回答。我会试试看。
我已经按照你的建议做了，但是 strace 日志文件很大，我还没有机会浏览它。我还重写了我的程序以延迟重试几次。这些问题似乎在半秒延迟后自行解决。谢谢
查看 strace 后，文件系统如您所料返回 ENOMEM。谢谢
不客气 - 抱歉，简单的解决方案不是问题！如果您无法使用 CIFS 解决问题，那么也许您可以一次将文件复制到 Linux 机器上以进行检查，或者可能分批进行检查，但这会使事情变得相当慢。