连接多个 Pandas 数据帧时出现内存错误答案

【问题标题】：Memory Error while concatenating multiple Pandas Dataframes连接多个 Pandas 数据帧时出现内存错误
【发布时间】：2021-05-07 14:09:14
【问题描述】：

我们正在尝试加载 IDS-2018 dataset，它由 10 个 CSV 文件组成，总大小为 6.4 GB。当我们尝试在 32GB RAM 服务器中连接所有 CSV 文件时，它崩溃了（进程被终止）。

我们甚至尝试通过使用优化 pandas 数据帧中的存储空间，


def reduce_mem_usage(df):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

但是没有用。服务器在连接每个 CSV 文件时仍然崩溃。我们使用pd.concat 连接了每个文件。整个代码是here。如何做到这一点，以便我们可以做进一步的处理？

【问题讨论】：

您的服务器上是否有任何限制特定进程的内存使用的限制？我想知道是否不是机器内存不足，而是操作系统杀死了进程，因为它不会让它拥有更多
我不认为有任何这样的限制，顺便说一句，Kaggle 内核也崩溃了。
reduce_mem_usage 不起作用，因为您只转换列的值，并将新值分配给同一内存。 float64 和 float32 的主要区别不是数字范围而是精度。
@Daniel 如果可能的话，你能说一下这个问题的任何解决方法吗？不使用的时候不是python的垃圾回收器释放的内存吗？

标签： python pandas machine-learning out-of-memory

【解决方案1】：

我会尝试以下方法：

通过 dtypes 参数指定 read_csv 上的列类型。
不创建 10 个数据帧并依赖 del。

import numpy as np
import pandas as pd

data_files = [
    './data/CSVs/02-14-2018.csv',
    './data/CSVs/02-15-2018.csv',
    ... # a few more
]

# define dtypes
data_types = {
  "col_a": np.float64,
  ... # other types
}

df = reduce_memory_usage(
    pd.read_csv(filename[0], dtype=data_types, index_col=False)
)
for filename[1:] in data_files:
    df = pd.concat(
        [
            df,
            reduce_mem_usage(
                pd.read_csv(
                    filename,
                    dtype=data_types,
                    index_col=False,
                )
            ),
        ],
        ignore_index=True,
    )

通过这种方式，您可以确保类型推断正是您需要的，并减少内存占用。此外，如果您的数据中有分类列，这些分类列通常在 CSV 文件中编码为字符串，您可以通过使用分类列数据类型大大减少内存占用。

【讨论】：

因为有将近80列，所以我直接使用了dtypes_of_0 = d0.dtypes.to_dict()。你发布的方法只是连接而不创建新的数据框，对吗？这有很大的不同吗？？我会尽快测试它，让你知道这是否有效。谢谢！
我刚刚检查过了。该过程仍然被杀死。我没有想法:(
您在运行脚本时检查过内存消耗吗？你确定是内存不足的问题。如果是这种情况，另一层 .memory 优化是将分类列转换为分类 dtype。请参阅stackoverflow.com/questions/39092067/… 您将不得不手动寻找这些列，但它可以产生巨大的差异。
是的，我已经在脚本运行时使用free -h 检查了内存消耗，整个虚拟内存都被进程耗尽并且终端在一段时间后变得无响应。当我通过熊猫加载时，所有列的原始数据类型都被推断为“对象”。那么，我应该如何手动知道它们不是列中的任何混合数据类型。并写出每一列的精确数据类型。
如果一切都被推断为object reduce_mem_usage 不会为你做任何事情。如果你真的必须加载所有这些，我会从小处开始，一次只加载几列，比如 5，使用 use_cols 参数到 read_csv。然后将这 5 列转储到镶木地板文件中。有点假装柱状商店。现在您至少可以查看您的数据以决定 dtype 转换。很好的总结在这里：stackoverflow.com/questions/15891038/…。另外，为什么你在d0 上没有index_col=False？我修改了答案。