减少数据写入时间答案

【问题标题】：Reduce writing time of data减少数据写入时间
【发布时间】：2015-05-20 18:06:03
【问题描述】：

我正在对 CFD 数据进行后处理（对坐标应用旋转）。为此，我正在执行以下操作：

-读取文件

-将数据存储到结构化数组中

-处理数据（进行计算）

-写一个新文件

它可以工作，但每个文件需要 7 秒。我有 (15000 * 4) 个文件要继续...

for i in range(0,len(file_count)):
    #Source folder with original files
    os.chdir(path+'\\'+folder_source_location)
    #Generate file names
    file_name = file_source_begin+("%0"+str(ndigit)+"d") % file_count[i]+"_tec.dat"

    #Read the file
    Data = read_tecUNS(file_name)

    #New data set modified
    Data_new = Data

    #Translation
    Data["node"]["X"]+=translator_plane2RotCenter[0]    #The += is important or the Data won't be affected by the translation
    Data["node"]["Y"]+=translator_plane2RotCenter[1]
    Data["node"]["Z"]+=translator_plane2RotCenter[2]

    #Rotation
    Y_temp = Data["node"]["Y"]*cos(theta_rot_rad)-Data["node"]["Z"]*sin(theta_rot_rad)
    Z_temp = Data["node"]["Y"]*sin(theta_rot_rad)+Data["node"]["Z"]*cos(theta_rot_rad)

    Data_new["node"]["Y"]=Y_temp
    Data_new["node"]["Z"]=np.mean(Z_temp)   #Due to rounding, the Z values are not exactly the same. The mean avoid that.

    #Write the new file
    os.chdir(path+'\\'+folder_source_location+'\\'+"Output")
    write_tecplot(file_name,Data_new)

您有什么改进的想法吗？我考虑过线程化写作，但我不确定它会改进什么。

以下是阅读/计算/写作时间的示例：

The output folder already exists. The data in it will be erased
StartReading B--0.000018_tec.dat in progress. - 0.001s elapsed
EndReading B--0.000018_tec.dat in progress. - 0.433s elapsed
StartWriting B--0.000018_tec.dat in progress. - 0.435s elapsed
EndWriting B--0.000018_tec.dat in progress. - 7.585s elapsed

StartReading B--0.000036_tec.dat in progress. - 7.586s elapsed
EndReading B--0.000036_tec.dat in progress. - 7.697s elapsed
StartWriting B--0.000036_tec.dat in progress. - 7.697s elapsed
EndWriting B--0.000036_tec.dat in progress. - 13.472s elapsed

还有一个脚本和一个示例来尝试更鲁莽的：

http://s000.tinyupload.com/index.php?file_id=80589646527340633700

【问题讨论】：

你没有向我们展示write_tecplot！确定这是最重要的一点吗？
^ 不仅如此，我们也看不到read_tecUNS() 方法...
由于阅读时间在0.1-0.4s之间，我认为这不是最关键的工作。无论如何，该功能相当长且丑陋（太长而无法发布），但可以在 Sample 包中找到！ :)

标签： python optimization time writing

【解决方案1】：

问题不在于写作本身，而在于如何为写作准备和格式化数据。

如果您使用 python -m cProfile -s cumtime Plane_modifier_rev4-multiple_files.py > out.txt 之类的内容分析您的脚本，您会发现大部分时间都花在了数组格式化上

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.003    0.003   22.297   22.297 Plane_modifier_rev4-multiple_files.py:6(<module>)
        2    0.282    0.141   21.881   10.941 ASCII_TEC.py:101(write_tecplot)
77424/48512    0.091    0.000   21.527    0.000 numeric.py:1681(array_str)
77424/48512    0.424    0.000   21.477    0.000 arrayprint.py:343(array2string)
    48512    0.928    0.000   21.149    0.000 arrayprint.py:233(_array2string)
   145536    0.360    0.000   12.532    0.000 arrayprint.py:533(__init__)
   145536    5.891    0.000   12.172    0.000 arrayprint.py:547(fillFormat)
    48512    0.219    0.000    7.922    0.000 arrayprint.py:700(__init__)
    48512    0.620    0.000    5.623    0.000 arrayprint.py:465(_formatArray)
   170236    2.416    0.000    4.413    0.000 arrayprint.py:598(__call__)
   631546    1.300    0.000    2.933    0.000 numeric.py:2428(seterr)
   434430    2.310    0.000    2.310    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   315773    0.337    0.000    1.941    0.000 numeric.py:2813(__enter__)
   143356    0.234    0.000    1.814    0.000 fromnumeric.py:1772(any)
   315773    0.359    0.000    1.689    0.000 numeric.py:2818(__exit__)
    48512    0.473    0.000    1.268    0.000 arrayprint.py:639(__init__)
   143356    0.157    0.000    1.163    0.000 {method 'any' of 'numpy.ndarray' objects}
   631546    0.967    0.000    1.034    0.000 numeric.py:2524(geterr)
   143356    0.092    0.000    1.006    0.000 _methods.py:37(_any)
   443944    0.763    0.000    0.944    0.000 arrayprint.py:632(_digits)
   143358    0.166    0.000    0.418    0.000 numeric.py:464(asanyarray)
   145536    0.410    0.000    0.410    0.000 {method 'compress' of 'numpy.ndarray' objects}

例如

这个

  for name in names:
        for col_index in range(0,N,5):  #The tecplot data for each variable are saved within 5 columns
            f.write(str(Data["node"][name][col_index:col_index+5])[1:-1]+"\n")
        f.write("\n"+"\n")

可以像这样重写（而且必须更快）

    for name in names:
        n = Data["node"][name]
        for col_index in range(0,N,5):  #The tecplot data for each variable are saved within 5 columns
            vs = n[col_index:col_index+5]
            f.write(",".join([str(v) for v in vs])+"\n")
        f.write("\n"+"\n")

编辑

write_tecplot 的一些变化

def write_tecplot(outfile,Data):
    """
    The expected Data is a dictionary with one structured array: node and one simple array: face
    """
    N = Data["node"].shape[0]   #N is the number of nodes
    E = Data["face"].shape[0]  #E is the number of faces

    #Create the file and the main names
    with open(outfile+'.dat', 'w') as f:
        """ Write HEADER """
        f.write('TITLE = \"title\"\n')
        f.write('VARIABLES  = ')
        #initialize
        names = Data["node"].dtype.names

        #write variable names
        f.write(u'"'+'\",\"'.join(names)+'"\n')
        f.write('ZONE T="tecdata", N=%s, E=%s, ET=QUADRILATERAL, F=FEBLOCK\n\n'%(N,E))

#        Data_number =  len(Data["node"])     #Data_number is the 

        """ WRITE DATA """
        #Write node data
        for name in names:
            n = Data["node"][name]
            for col_index in range(0,N,5):  #The tecplot data for each variable are saved within 5 columns
                f.write(",".join([str(v) for v in n[col_index:col_index+5]])+"\n")
            f.write("\n\n")


        face = Data["face"]
        for col_index in range(0,E,1):  #The tecplot data for each variable are saved within 5 columns
            f.write(",".join([str(v) for v in face[col_index]])+"\n")
        f.write("\n\n")

【讨论】：

简直……哇！将时间从 7 秒缩短到 2 秒！我真的不明白这两个脚本之间的主要区别，为什么我的脚本真的很慢？？！！我以为更少的代码行比几行代码更有可能更快，但我错了！ :O 你能解释一下为什么吗？或者可能只是凭经验？非常感谢！
你能检查一下吗？（应该有相同的输出）
至于速度不同，请查看 profiling。你（隐含地）调用了很多数组和 numpy 格式化方法，你在一个循环中解析值（像 Data["node"][name] 之类的链），而你可以只将值保留在外面和其他一些小事情。主要问题是过多的数组格式化/字符串化。
哇！ 7 -> 0.35 秒写入文件！所以主要思想是在使用numpy数组时尽可能多地使用临时变量？！可能是因为当我执行 Data["node"] 时，调用了整个数组？无论如何，非常感谢你！ :D
基本思想是分析 :) 但是大的变化来自我们自己格式化数组，而不是让 numpy 格式化。检查执行 str(numpy_array) 时运行的代码量：github.com/numpy/numpy/blob/master/numpy/core/…