如何将多个熊猫数据帧转换为内存约束中的数组？答案

【问题标题】：How to transform multiple pandas dataframes to array in memory constrains?如何将多个熊猫数据帧转换为内存约束中的数组？
【发布时间】：2020-02-25 14:39:40
【问题描述】：

给定的问题：我有从文件夹 1 到文件夹 999 的文件夹。在每个文件夹中都有 parquet 文件 - 名称从 1.parquet 到 999.parquet。每个镶木地板由给定结构的熊猫数据框组成：

id   |title   |a
1    |abc     |1
1    |abc     |3
1    |abc     |2
2    |abc     |1
...  |def     | ...

a 列可以是 a1 到 a3 范围内的值。

部分步骤是获取结构：

id | title | a1 | a2 | a3
1  | abc   | 1  | 1  | 1
2  | abc   | 1  | 0  | 0
...

为了获得final表格，：

    title
id | abc | def | ...
1  | 3   | ... |
2  | 1   | ... |

abc 列的值是 a1、a2 和 a3 列的总和。

目标是获得对所有文件夹中所有 parquet 文件计算的最终形式。

现在，我现在的情况是这样的：我确实知道如何通过部分步骤接收最终形式，例如通过使用 sparse.coo_matrix() 就像在 How to make full matrix from dense pandas dataframe 中解释的那样。

问题是：由于内存限制，我不能一次简单地读取所有镶木地板。

我有三个问题：

如果我有大量数据（假设每个 parquet 文件由 500MB 组成），如何有效地到达那里？
我可以将每个镶木地板分别转换为最终形式，然后以某种方式合并它们吗？如果是，我该怎么做？
有没有办法跳过部分步骤？

【问题讨论】：

标签： python pandas bigdata

【解决方案1】：

对于文件中的每个数据框，您似乎都

按id、title 列分组数据
现在，对每组 a 列中的数据求和

没有必要为任务创建完整矩阵，partial 步骤也是如此。

我不确定，一个文件中存在多少个 id、title 的独特组合，或者全部存在。一个安全的步骤是批量处理文件，保存结果，然后合并所有结果

看起来像，

import pandas as pd
import numpy as np
import string

def gen_random_data(N, M):
    # N = 100
    # M = 10

    titles = np.apply_along_axis(lambda x: ''.join(x), 1, np.random.choice(list(string.ascii_lowercase), 3*M).reshape(-1, 3))
    titles = np.random.choice(titles, N)
    _id = np.random.choice(np.arange(M) + 1, N)
    val = np.random.randint(M, size=(N,))

    df = pd.DataFrame(np.vstack((_id, titles, val)).T, columns=['id', 'title', 'a'])
    df = df.astype({'id': np.int64, 'title': str, 'a': np.int64})

    return df

def combine_results(grplist):
    # stitch into one dataframe
    comb_df = pd.concat(dflist, axis=1)

    # Sum over common axes i.e. id, titles
    comb_df = comb_df.apply(lambda row: np.nansum(row), axis=1)

    # Return a data frame with sum of a's
    return comb_df.to_frame('sum_of_a')

totalfiles = 10
batch      = 2
filelist   = []
for counter,nfiles in enumerate(range(0, totalfiles, batch)):
    # Read data from files. generate random data
    dflist = [gen_random_data(100, 2) for _ in range(nfiles)]

    # Process the data in memory
    dflist = [_.groupby(['id', 'title']).agg(['sum']) for _ in dflist]

    collection = combine_results(dflist)

    # write intermediate results to file and repeat the process for the rest of the files
    intermediate_result_file_name = f'resfile_{counter}'
    collection.to_parquet(intermediate_result_file_name, index=True)
    filelist.append(intermediate_result_file_name)

# Combining result files.
collection = [pd.read_parquet(file) for file in filelist]
totalresult = combine_results(collection)

【讨论】：