使用 Pandas 为大型数据集嵌套 for 循环答案

【问题标题】：Nested for loops for Large Datasets using Pandas使用 Pandas 为大型数据集嵌套 for 循环
【发布时间】：2018-04-23 09:56:52
【问题描述】：

我正在进行数据分析，我必须生成直方图。我的代码有超过 7 个嵌套的 for 循环。每个嵌套循环通过类别中的唯一值过滤数据框，以形成子类别的新数据框，然后像以前一样进一步拆分。每天大约有 400,000 条记录。我必须处理过去 30 天的记录。结果是为最后一个不可拆分类别的值（只有一个数字列）生成直方图。如何降低复杂性？任何替代方法？

for customer in data_frame['MasterCustomerID'].unique():
    df_customer = data_frame.loc[data_frame['MasterCustomerID'] == customer]
    for service in df_customer['Service'].unique():
        df_service = df_customer.loc[df_customer['Service'] == service]
        for source in df_service['Source'].unique():
            df_source = df_service.loc[df_service['Source'] == source]
            for subcomponent in df_source['SubComponentType'].unique():
                df_subcomponenttypes = df_source.loc[df_source['SubComponentType'] == subcomponent]
                for kpi in df_subcomponenttypes['KPI'].unique():
                    df_kpi = df_subcomponenttypes.loc[df_subcomponenttypes['KPI'] == kpi]
                    for device in df_kpi['Device_Type'].unique():
                        df_device_type = df_kpi.loc[df_kpi['Device_Type'] == device]
                        for access in df_device_type['Access_type'].unique():
                            df_access_type = df_device_type.loc[df_device_type['Access_type'] == access]
                            df_access_type['Day'] = ifweekday(df_access_type['PerformanceTimeStamp'])

【问题讨论】：

你能提供一些数据吗？
我不会假设您不是pandas 专家，但pandas 软件包的目的之一是专门避免for 循环
对不起！这是公司的私人数据。那将是安全漏洞。
好的，我已经添加了我需要解决的代码
您可以使用 Pandas 自己的例程 groupby 和 filter 以矢量化方式完成您的工作。

标签： python pandas dataset nested-loops code-complexity

【解决方案1】：

您可以使用pandas.groupby 查找不同级别的列的唯一组合（请参阅here 和here），然后遍历按每个组合分组的数据框。有大约 4000 种组合，因此在取消注释下面的直方图代码时要小心。

import string
import numpy as np, pandas as pd
from matplotlib import pyplot as plt

np.random.seed(100)

# Generate 400,000 records (400 obs for 1000 individuals in 6 columns)
NIDS = 1000; NOBS = 400; NCOLS = 6

df = pd.DataFrame(np.random.randint(0, 4, size = (NIDS*NOBS, NCOLS)))
mapper = dict(zip(range(26), list(string.ascii_lowercase)))
df.replace(mapper, inplace = True)

cols = ['Service', 'Source', 'SubComponentType', \
    'KPI', 'Device_Type', 'Access_type']
df.columns = cols

# Generate IDs for individuals
df['MasterCustomerID'] = np.repeat(range(NIDS), NOBS)

# Generate values of interest (to be plotted)
df['value2plot'] = np.random.rand(NIDS*NOBS)

# View the counts for each unique combination of column levels
df.groupby(cols).size()

# Do something with the different subsets (such as make histograms)
for levels, group in df.groupby(cols):
    print(levels)

    # fig, ax = plt.subplots()
    # ax.hist(group['value2plot'])
    # ax.set_title(", ".join(levels))
    # plt.savefig("hist_" + "_".join(levels) + ".png")
    # plt.close()

【讨论】：