在python中编写巨大的字符串答案

【问题标题】：Writing huge strings in python在python中编写巨大的字符串
【发布时间】：2015-04-09 19:52:47
【问题描述】：

我有一个非常长的字符串，几乎有 1 兆字节长，我需要将其写入文本文件。常规的

file = open("file.txt","w")
file.write(string)
file.close()

可以，但是太慢了，有什么方法可以让我写得更快吗？

我正在尝试将数百万位数字写入文本文件编号为math.factorial(67867957)的顺序

这是分析中显示的内容：

    203 function calls (198 primitive calls) in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 re.py:217(compile)
        1    0.000    0.000    0.000    0.000 re.py:273(_compile)
        1    0.000    0.000    0.000    0.000 sre_compile.py:172(_compile_charset)
        1    0.000    0.000    0.000    0.000 sre_compile.py:201(_optimize_charset)
        4    0.000    0.000    0.000    0.000 sre_compile.py:25(_identityfunction)
      3/1    0.000    0.000    0.000    0.000 sre_compile.py:33(_compile)
        1    0.000    0.000    0.000    0.000 sre_compile.py:341(_compile_info)
        2    0.000    0.000    0.000    0.000 sre_compile.py:442(isstring)
        1    0.000    0.000    0.000    0.000 sre_compile.py:445(_code)
        1    0.000    0.000    0.000    0.000 sre_compile.py:460(compile)
        5    0.000    0.000    0.000    0.000 sre_parse.py:126(__len__)
       12    0.000    0.000    0.000    0.000 sre_parse.py:130(__getitem__)
        7    0.000    0.000    0.000    0.000 sre_parse.py:138(append)
      3/1    0.000    0.000    0.000    0.000 sre_parse.py:140(getwidth)
        1    0.000    0.000    0.000    0.000 sre_parse.py:178(__init__)
       10    0.000    0.000    0.000    0.000 sre_parse.py:183(__next)
        2    0.000    0.000    0.000    0.000 sre_parse.py:202(match)
        8    0.000    0.000    0.000    0.000 sre_parse.py:208(get)
        1    0.000    0.000    0.000    0.000 sre_parse.py:351(_parse_sub)
        2    0.000    0.000    0.000    0.000 sre_parse.py:429(_parse)
        1    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
        1    0.000    0.000    0.000    0.000 sre_parse.py:726(fix_flags)
        1    0.000    0.000    0.000    0.000 sre_parse.py:738(parse)
        3    0.000    0.000    0.000    0.000 sre_parse.py:90(__init__)
        1    0.000    0.000    0.000    0.000 {built-in method compile}
        1    0.001    0.001    0.001    0.001 {built-in method exec}
       17    0.000    0.000    0.000    0.000 {built-in method isinstance}
    39/38    0.000    0.000    0.000    0.000 {built-in method len}
        2    0.000    0.000    0.000    0.000 {built-in method max}
        8    0.000    0.000    0.000    0.000 {built-in method min}
        6    0.000    0.000    0.000    0.000 {built-in method ord}
       48    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        5    0.000    0.000    0.000    0.000 {method 'find' of 'bytearray' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}

【问题讨论】：

兆字节不是“巨大的”。您确定您的磁盘可以比 python 写入更快吗？您能否提供一个独立的基准测试，例如 python3 -c'open('file', 'w').write("a"*1000000)' 您的计算机上的时间是几点？期望的时间是什么时候？
大声笑，写 1 MB 文件不可能需要几个小时......它应该最多需要几秒钟（这很慷慨）......正如@JFSebastian 提到的，请用一些简单的...
您的分析显示了什么？ docs.python.org/2/library/profile.html
/usr/bin/time python -c'import gmpy2; open("/tmp/file", "w").write(str(gmpy2.fac(67867957)))' 在我的机器上花费不到 10 分钟。 /tmp/file 包含 500M 位数字
正如@J.F.Sebastian 所回答的，基本问题是str(long) 具有二次运行时间。我有偏见，因为我维护gmpy2，但如果你打算处理如此庞大的数字，你真的应该使用'gmpy2. BTW, the current development version (2.1.x) includes the primorial` 函数。 gmpy2.primorial(67867957) 大约需要 3.5 秒。

标签： python performance python-3.x file-io

【解决方案1】：

您的问题是 str(long) 对于 Python 中的大整数（数百万位）非常慢。 It is a quadratic operation (in number of digits) in Python 即，对于 ~1e8 位，可能需要 ~1e16 次操作才能将整数转换为十进制字符串。

写入 500MB 的文件不应花费数小时，例如：

$ python3 -c 'open("file", "w").write("a"*500*1000000)'

几乎立即返回。 ls -l file 确认文件已创建并且具有预期的大小。

计算 math.factorial(67867957)（结果有大约 500M 位）可能需要几个小时，但使用 pickle 保存它是瞬时的：

import math
import pickle

n = math.factorial(67867957) # takes a long time
with open("file.pickle", "wb") as file:
    pickle.dump(n, file) # very fast (comparatively)

使用n = pickle.load(open('file.pickle', 'rb')) 将其加载回来需要不到一秒钟的时间。

str(n) 仍在我的机器上运行（50 小时后）。

要快速获得十进制表示，您可以use gmpy2:

$ python -c'import gmpy2;open("file.gmpy2", "w").write(str(gmpy2.fac(67867957)))'

在我的机器上不到 10 分钟。

【讨论】：

【解决方案2】：

好吧，这真的不是一个答案，而是更多地证明你对延迟的推理是错误的

第一次测试大字符串的写入速度

 import timeit
 def write_big_str(n_bytes=1000000):
     with open("test_file.txt","wb") as f:
          f.write("a"*n_bytes)
 print timeit.timeit("write_big_str()","from __main__ import write_big_str",number=100)

你应该会看到相当可观的速度（那就是重复 100 次）

接下来我们看看将一个很大的数字转换成一个str需要多长时间

import timeit,math
n = math.factorial(200000)
print timeit.timeit("str(n)","from __main__ import n",number=1)

它可能需要大约 10 秒（这是一百万位数字），这很慢......但不会慢几个小时（好吧，转换为字符串很慢：P...但仍然不应该花费几个小时）（好吧，我猜我的盒子花了大约 243 秒：P）

【讨论】：

这是东西，200000！与我正在写的数字相比，它看起来很小。我的号码比67867957小一点！ python 写 49979687#（# 是原始符号）没有问题，大约是 49979687！
啊，现在我们开始获得有用的数据...确实，字符串转换可能需要很长时间...还有超过一百万位...
@JoãoAreias 67867957！长度约为 5 亿个十进制数字，即我们说的是约 500 MB，而不是 1 MB。
哦，对不起我的错误。我在这里写的时候读错了，不是 1MB 更像是 100，原始仍然小于阶乘（尽管仍然很大）有什么方法可以加快这个过程还是我只需要等待和处理是吗？
哦，是的，1亿位数字需要一段时间才能转换为字符串...