如何加快del答案

【问题标题】：How to speed up del如何加快del
【发布时间】：2021-01-17 13:49:07
【问题描述】：

我们的代码中有一个庞大的 pandas 数据框 - 形状为 (102730344, 50)。为了释放内存，我们在不再需要此数据帧时放入该数据帧的 del 。该 del 语句目前在强大的硬件上运行需要 4 小时。有没有办法加快速度？

代码流程如下：

big_data_df, small_df, medium_data, smaller_df = get_data(params)
#commented out code
del big_data_df # this takes 4 hours

所以我们调用一个返回 4 个数据帧的函数，其中一个是我们稍后要删除的大数据帧。我们已经注释掉了获取数据框和不再需要测试时删除它之间的代码。然后 del 运行，执行后的日志语句显示运行时间为 4 小时。

【问题讨论】：

你在做什么，正是。请注意，del 本身不会释放内存。它会删除一个名称，在最简单的情况下，del some_name。它也是del some_container[item] 的一部分，它只是类some_container.__delitem__(item)。
一些相关阅读：pandas.pydata.org/pandas-docs/stable/user_guide/scale.html
dataframe 中有哪些数据类型？如果是object，则需要取消引用并删除所有单个对象。
请回答@tdelaney 的问题。这一点很重要。如果主要类型是object，那么也尝试使用（至少）3.8 系列中的 Python，原因部分解释如下：stackoverflow.com/questions/63348685/…
是的，我们正在使用 read_sql 从 SQL 中读取数据，这将返回一个数据框，其中大部分列都是对象。

标签： python pandas

【解决方案1】：

您可以在子进程中创建大型数据框，但只将您想要的内容发送给父进程，然后使用os_exit() 跳过单个对象清理。这是否适合您取决于返回的数据的相对大小。在您的情况下，SQL 和数据框的创建/处理可能在子流程中完成。在此示例中，我将结果发送到 stdout，但将结果保存到临时文件也是合理的。我正在使用 pickle，但其他序列化程序（例如 pyarrow）可能会更快。

....在你的情况下它可能根本不起作用。

dfuser.py

import sys
import subprocess as subp
import pandas as pd

try:
    proc = subp.Popen([sys.executable, 'dfprocessor.py'], stdin=subp.PIPE, stdout=subp.PIPE, stderr=None)
    df = pd.read_pickle(proc.stdout, compression=None)
    print("got df")
    proc.stdin.write(b"thanks\n")
    proc.stdin.close()
    proc.wait()
    print(df)
finally:
    print('parent done')

dfcreator.py

import pandas as pd
import sys
import os

try:
    # add your df creation and processing here
    df = pd.util.testing.makeDataFrame()
    small_df = df # your processing makes it smaller
    # send
    small_df.to_pickle(sys.stdout.buffer, compression=None)
    sys.stdout.close()
    # make sure received
    sys.stdin.read(1)
finally:
    # exit without deleting df to save time
    sys.stderr.write("out of here\n")
    os._exit(0)

【讨论】：