Pandas Dataframe：熔化交替的列组以进行宽到长转换答案

【问题标题】：Pandas Dataframe: melt alternating groups of columns for wide to long conversionPandas Dataframe：熔化交替的列组以进行宽到长转换
【发布时间】：2017-07-31 04:41:45
【问题描述】：

我希望有人可以帮助我将当前的数据帧从宽格式转换为长格式。我正在使用 Pandas 0.18.0，但似乎在 stackoverflow 上找不到任何其他适合我需要的解决方案。

任何帮助将不胜感激！

我有 50 个步骤，每个步骤有两个类别（状态/时间）我需要融化，这些类别在我的数据框中交替出现。 下面是一个只有 3 组的示例，但这种模式一直持续到 50 组。

状态可以是：yes/no/NaN

时间可以是：timestamp/NaN

当前数据框：

       cl_id  cl_template_id status-1 time-1                     status-2 time-2                     status-3 time-3                    
0      18434   107            NaN                            NaN  NaN                            NaN  NaN                            NaN
1      18280   117            yes      2016-12-28T18:21:58+00:00  yes      2016-12-28T20:47:31+00:00  yes      2016-12-28T20:47:32+00:00
2      18356   413            yes      2017-01-11T19:23:10+00:00  yes      2017-01-11T19:23:11+00:00  yes      2017-01-11T19:23:11+00:00
3      18358   430            NaN                            NaN  NaN                            NaN  NaN                            NaN
4      18359   430            yes      2017-01-11T19:20:32+00:00  yes      2017-01-11T19:20:34+00:00  NaN                            NaN
.
.
.

目标数据框：

cl_id cl_template_id   step   status   time
18434 107               1      NaN      NaN
18434 107               2      NaN      NaN
18434 107               3      NaN      NaN
18280 117               1      yes      2016-12-28T18:21:58+00:00
18280 117               2      yes      2016-12-28T20:47:31+00:00
18280 117               3      yes      2016-12-28T20:47:32+00:00
18356 413               1      yes      2017-01-11T19:23:10+00:00
18356 413               2      yes      2017-01-11T19:23:11+00:00
18356 413               3      yes      2017-01-11T19:23:11+00:00
.
.
.

【问题讨论】：

标签： python pandas melt

【解决方案1】：

旧线程，但我遇到了同样的问题，我认为Ted Petrou 的这个答案可以在这里完美地帮助你：Pandas Melt several groups of columns into multiple target columns by name

pd.wide_to_long(df, stubnames, i, j, sep, suffix)

简而言之：pd.wide_to_long() 函数允许您指定要取消透视的各个列之间的公共组件。

例如，我的数据框与您的类似，如下所示：

pd.melt 和 pd.unstack 让您接近，但不允许您通过它们的共同点来定位这些增量列组。

【讨论】：

【解决方案2】：

希望这个答案能为问题提供一些见解。

首先，我将从您的数据框中重新创建一个示例：

# Make example dataframe
df = pd.DataFrame({'cl_id' : [18434, 18280, 18356, 18358, 18359],
                   'cl_template_id' : [107, 117, 413, 430, 430],
                   'status_1' : [np.NaN, 'yes', 'yes', np.NaN, 'yes'],
                   'time_1' : [np.NaN, '2016-12-28T18:21:58+00:00', '2017-01-11T19:23:10+00:00', np.NaN, '2017-01-11T19:20:32+00:00'],
                   'status_2' : [np.NaN, 'yes', 'yes', np.NaN, 'yes'],
                   'time_2' : [np.NaN, '2016-12-28T20:47:31+00:00', '2017-01-11T19:23:11+00:00', np.NaN, '2017-01-11T19:20:34+00:00'],
                   'status_3' : [np.NaN, 'yes', 'yes', np.NaN, np.NaN],
                   'time_3' : [np.NaN, '2016-12-28T20:47:32+00:00', '2017-01-11T19:23:11+00:00', np.NaN, np.NaN]})

其次，将time_1,2,3转换为日期时间：

# Convert time_1,2,3 to datetime
df.loc[:, 'time_1'] = pd.to_datetime(df.loc[:, 'time_1'])
df.loc[:, 'time_2'] = pd.to_datetime(df.loc[:, 'time_2'])
df.loc[:, 'time_3'] = pd.to_datetime(df.loc[:, 'time_3'])

第三，将dataframe一分为二，一分状态，一分时间：

# Split df into a status, time dataframe
df_status = df.loc[:, :'status_3']
df_time = df.loc[:, ['cl_id', 'cl_template_id']].merge(df.loc[:, 'time_1':],
                                                       left_index = True,
                                                       right_index = True)

第四，熔化状态和时间数据帧：

# Melt status
df_status = df_status.melt(id_vars = ['cl_id',
                                      'cl_template_id'],
                           value_vars = ['status_1',
                                         'status_2',
                                         'status_3'],
                           var_name = 'step',
                           value_name = 'status')

# Melt time
df_time = df_time.melt(id_vars = ['cl_id',
                                  'cl_template_id'],
                       value_vars = ['time_1',
                                     'time_2',
                                     'time_3'],
                       var_name = 'step',
                       value_name = 'time')

第五，清理状态和时间数据框中的“步骤”列，只保留数字：

# Clean step in status, time
df_status.loc[:, 'step'] = df_status.loc[:, 'step'].str.partition('_')[2]
df_time.loc[:, 'step'] = df_time.loc[:, 'step'].str.partition('_')[2]

第六，将状态和时间数据帧重新合并到最终数据帧中：

# Merge status, time back together on cl_id, cl_template_id
final = df_status.merge(df_time,
                        how = 'inner',
                        on = ['cl_id',
                              'cl_template_id',
                              'step']).sort_values(by = ['cl_template_id',
                                                         'cl_id']).reset_index(drop = True)

瞧！您正在寻找的答案：

【讨论】：