Pandas：合并分层数据答案

【问题标题】：Pandas: Merge hierarchical dataPandas：合并分层数据
【发布时间】：2014-09-05 10:18:57
【问题描述】：

我正在寻找一种将具有复杂层次结构的数据合并到 pandas DataFrame 的方法。这种层次结构是由数据内不同的相互依赖关系产生的。例如。有一些参数定义了数据是如何产生的，然后有时间相关的可观察对象、空间相关的可观察对象，以及同时依赖于时间和空间的可观察对象。

更明确地说：假设我有以下数据。

#  Parameters
t_max = 2
t_step = 15
sites = 4

# Purely time-dependent
t = np.linspace(0, t_max, t_step)
f_t = t**2 - t

# Purely site-dependent
position = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])  # (x, y)
site_weight = np.arange(sites)

# Time-, and site-dependent.
occupation = np.arange(t_step*sites).reshape((t_step, sites))

# Time-, and site-, site-dependent
correlation = np.arange(t_step*sites*sites).reshape((t_step, sites, sites))

（当然，最后我会有很多这样的数据集。每组参数一个。）

现在，我想将所有这些存储到 pandas DataFrame 中。我想象最终结果看起来像这样：

| ----- parameters ----- | -------------------------------- observables --------------------------------- |
|                        |                                        | ---------- time-dependent ----------- |
|                        | ----------- site-dependent --- )       ( ------------------------ |            |
|                        |                                | - site2-dependent - |                         |
| sites | t_max | t_step | site | r_x | r_y | site weight | site2 | correlation | occupation | f_t | time |

我认为部分重叠的层次结构可能无法实现。如果它们是隐含的也没关系，从某种意义上说，我可以得到例如通过以特定方式索引 DataFrame 来获取所有与站点相关的数据。

另外，如果您认为在 Pandas 中有更好的数据排列方式，请随时告诉我。

问题

如何构造一个包含所有上述数据的DataFrame，并以某种方式反映相互依赖关系（例如，f_t 依赖于time，但不依赖于site）。并且所有这些都以足够通用的方式进行，因此很容易添加或删除某些可观察对象，并可能具有新的相互依赖关系。（例如，依赖于第二个时间轴的量，例如时间-时间相关性。）

到目前为止我得到了什么

在下文中，我将向您展示我自己已经走了多远。但是，我认为这不是实现上述目标的理想方式。特别是，因为它在添加或删除某些可观察对象方面缺乏通用性。

指数

鉴于上述数据，我首先定义了我需要的所有多重指数。

ind_time = pd.Index(t, name='time')
ind_site = pd.Index(np.arange(sites), name='site')
ind_site_site = pd.MultiIndex.from_product([ind_site, ind_site], names=['site', 'site2'])
ind_time_site = pd.MultiIndex.from_product([ind_time, ind_site], names=['time', 'site'])
ind_time_site_site = pd.MultiIndex.from_product([ind_time, ind_site, ind_site], names=['time', 'site', 'site2'])

个人`DataFrame`s

接下来，我创建了各个数据块的数据框。

df_parms = pd.DataFrame({'t_max': t_max, 't_step': t_step, 'sites': sites}, index=[0])
df_time = pd.DataFrame({'f_t': f_t}, index=ind_time)
df_position = pd.DataFrame(position, columns=['r_x', 'r_y'], index=ind_site)
df_weight = pd.DataFrame(site_weight, columns=['site weight'], index=ind_site)
df_occupation = pd.DataFrame(occupation.flatten(), index=ind_time_site, columns=['occupation'])
df_correlation = pd.DataFrame(correlation.flatten(), index=ind_time_site_site, columns=['correlation'])

df_parms 中的 index=[0] 似乎是必要的，否则 Pandas 只会抱怨标量值。实际上，我可能会用运行此特定模拟的时间戳来替换它。这至少会传达一些有用的信息。

合并 Observables

有了可用的数据框，我将所有可观察数据合并成一个大DataFrame。

df_all_but_parms = pd.merge(
  pd.merge(
    pd.merge(
      df_time.reset_index(),
      df_occupation.reset_index(),
      how='outer'
    ),
    df_correlation.reset_index(),
    how='outer'
  ),
  pd.merge(
    df_position.reset_index(),
    df_weight.reset_index(),
    how='outer'
  ),
  how='outer'
)

这是我目前的方法中最不喜欢的一点。 merge 函数仅适用于数据帧对，并且它要求它们至少有一个公共列。所以，我必须小心加入我的数据框的顺序，如果我要添加一个正交可观察对象，那么我不能将它与其他数据合并，因为它们不会共享一个公共列。是否有一个可用的函数可以通过对数据帧列表的一次调用来实现相同的结果？我试过concat，但它不会合并公共列。所以，我最终得到了很多重复的 time 和 site 列。

合并所有数据

最后，我将我的数据与参数合并。

pd.concat([df_parms, df_all_but_parms], axis=1, keys=['parameters', 'observables'])

到目前为止，最终结果如下所示：

         parameters                 observables                                                                       
              sites  t_max  t_step         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
    0             4      2      15     0.000000  0.000000     0           0      0            0    0    0            0
    1           NaN    NaN     NaN     0.000000  0.000000     0           0      1            1    0    0            0
    2           NaN    NaN     NaN     0.000000  0.000000     0           0      2            2    0    0            0
    3           NaN    NaN     NaN     0.000000  0.000000     0           0      3            3    0    0            0
    4           NaN    NaN     NaN     0.142857 -0.122449     0           4      0           16    0    0            0
    ..          ...    ...     ...          ...       ...   ...         ...    ...          ...  ...  ...          ...
    235         NaN    NaN     NaN     1.857143  1.591837     3          55      3          223    1    1            3
    236         NaN    NaN     NaN     2.000000  2.000000     3          59      0          236    1    1            3
    237         NaN    NaN     NaN     2.000000  2.000000     3          59      1          237    1    1            3
    238         NaN    NaN     NaN     2.000000  2.000000     3          59      2          238    1    1            3
    239         NaN    NaN     NaN     2.000000  2.000000     3          59      3          239    1    1            3

如您所见，这并不能很好地工作，因为实际上只为第一行分配了参数。所有其他行只有NaNs 代替参数。但是，由于这些是所有数据的参数，它们也应该包含在该数据帧的所有其他行中。

作为一个小问题：如果我将上述数据帧存储在 hdf5 中，pandas 会有多聪明。我最终会得到大量重复数据，还是会避免重复存储？

更新

感谢Jeff's answer，我能够通过通用合并将所有数据推送到一个数据帧中。基本思想是，我所有的 observables 已经有一些共同的列。即参数。

首先，我将参数添加到所有可观察对象的数据帧中。

all_observables = [ df_time, df_position, df_weight, df_occupation, df_correlation ]
flat = map(pd.DataFrame.reset_index, all_observables)
for df in flat:
    for c in df_parms:
        df[c] = df_parms.loc[0,c]

然后我可以通过归约将它们合并在一起。

df_all = reduce(lambda a, b: pd.merge(a, b, how='outer'), flat)

其结果具有所需的形式：

         time       f_t  sites  t_max  t_step  site  r_x  r_y  site weight  occupation  site2  correlation
0    0.000000  0.000000      4      2      15     0    0    0            0           0      0            0
1    0.000000  0.000000      4      2      15     0    0    0            0           0      1            1
2    0.000000  0.000000      4      2      15     0    0    0            0           0      2            2
3    0.000000  0.000000      4      2      15     0    0    0            0           0      3            3
4    0.142857 -0.122449      4      2      15     0    0    0            0           4      0           16
5    0.142857 -0.122449      4      2      15     0    0    0            0           4      1           17
6    0.142857 -0.122449      4      2      15     0    0    0            0           4      2           18
..        ...       ...    ...    ...     ...   ...  ...  ...          ...         ...    ...          ...
233  1.857143  1.591837      4      2      15     3    1    1            3          55      1          221
234  1.857143  1.591837      4      2      15     3    1    1            3          55      2          222
235  1.857143  1.591837      4      2      15     3    1    1            3          55      3          223
236  2.000000  2.000000      4      2      15     3    1    1            3          59      0          236
237  2.000000  2.000000      4      2      15     3    1    1            3          59      1          237
238  2.000000  2.000000      4      2      15     3    1    1            3          59      2          238
239  2.000000  2.000000      4      2      15     3    1    1            3          59      3          239

通过重新索引数据，层次结构变得更加明显：

df_all.set_index(['t_max', 't_step', 'sites', 'time', 'site', 'site2'], inplace=True)

导致

                                             f_t  r_x  r_y  site weight  occupation  correlation
t_max t_step sites time     site site2                                                          
2     15     4     0.000000 0    0      0.000000    0    0            0           0            0
                                 1      0.000000    0    0            0           0            1
                                 2      0.000000    0    0            0           0            2
                                 3      0.000000    0    0            0           0            3
                   0.142857 0    0     -0.122449    0    0            0           4           16
                                 1     -0.122449    0    0            0           4           17
                                 2     -0.122449    0    0            0           4           18
...                                          ...  ...  ...          ...         ...          ...
                   1.857143 3    1      1.591837    1    1            3          55          221
                                 2      1.591837    1    1            3          55          222
                                 3      1.591837    1    1            3          55          223
                   2.000000 3    0      2.000000    1    1            3          59          236
                                 1      2.000000    1    1            3          59          237
                                 2      2.000000    1    1            3          59          238
                                 3      2.000000    1    1            3          59          239

【问题讨论】：

你正试图在一帧中推很多东西。考虑一个多索引（比如在索引上），枚举级别的笛卡尔积。你的数据是这样的n维吗？列中的多索引实际上只是一个标签约定。
@Jeff 感谢您的评论。我不确定我明白你在说什么。您是否建议我将这些东西存储在单独的数据框中？或者您是否建议我应该在最终表格中添加多个索引以更好地构建数据？
你为什么不展示一些你想做的操作（什么样的选择，数字操作等等）。如果你能给出一个示例输出是最好的
我不会在你的框架中包含df_parms； df_all_but_parms 看起来还不错
@Jeff 因为我还没有让数据框工作，所以我还没有任何代码使用它。但是，从概念上讲，这就是我想用它做的事情。首先，我想将 observables 绘制为具有某些参数的 time 的函数，或情节图例中的站点索引。我也想用这些做一些基本的算术，例如kinetic_energy + interaction_energy。或者，我想将按站点权重加权的站点占用加在一起，然后将其绘制为时间的函数。

标签： python pandas merge dataframe

【解决方案1】：

我认为您应该这样做，将df_parms 作为您的索引。这样您就可以轻松地使用不同的参数连接更多帧。

In [67]: pd.set_option('max_rows',10)

In [68]: dfx = df_all_but_parms.copy()

你需要将列分配给框架（你也可以直接构造一个多索引，但这是从你的数据开始的）。

In [69]: for c in df_parms.columns:
             dfx[c] = df_parms.loc[0,c]

In [70]: dfx
Out[70]: 
         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight  sites  t_max  t_step
0    0.000000  0.000000     0           0      0            0    0    0            0      4      2      15
1    0.000000  0.000000     0           0      1            1    0    0            0      4      2      15
2    0.000000  0.000000     0           0      2            2    0    0            0      4      2      15
3    0.000000  0.000000     0           0      3            3    0    0            0      4      2      15
4    0.142857 -0.122449     0           4      0           16    0    0            0      4      2      15
..        ...       ...   ...         ...    ...          ...  ...  ...          ...    ...    ...     ...
235  1.857143  1.591837     3          55      3          223    1    1            3      4      2      15
236  2.000000  2.000000     3          59      0          236    1    1            3      4      2      15
237  2.000000  2.000000     3          59      1          237    1    1            3      4      2      15
238  2.000000  2.000000     3          59      2          238    1    1            3      4      2      15
239  2.000000  2.000000     3          59      3          239    1    1            3      4      2      15

[240 rows x 12 columns]

设置索引（这会返回一个新对象）

In [71]: dfx.set_index(['sites','t_max','t_step'])
Out[71]: 
                        time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
sites t_max t_step                                                                                 
4     2     15      0.000000  0.000000     0           0      0            0    0    0            0
            15      0.000000  0.000000     0           0      1            1    0    0            0
            15      0.000000  0.000000     0           0      2            2    0    0            0
            15      0.000000  0.000000     0           0      3            3    0    0            0
            15      0.142857 -0.122449     0           4      0           16    0    0            0
...                      ...       ...   ...         ...    ...          ...  ...  ...          ...
            15      1.857143  1.591837     3          55      3          223    1    1            3
            15      2.000000  2.000000     3          59      0          236    1    1            3
            15      2.000000  2.000000     3          59      1          237    1    1            3
            15      2.000000  2.000000     3          59      2          238    1    1            3
            15      2.000000  2.000000     3          59      3          239    1    1            3

[240 rows x 9 columns]

【讨论】：

谢谢，将参数放入索引可能是个好主意。你知道有没有更好的方法来构造df_all_but_parms？理想情况下，只需调用一系列数据帧，这些数据帧就会自动合并到常用列上。
看起来很乱，但根据您的数据可能不是更简单的方法。不过，您可以使用df_position.join(df_weight) 侥幸逃脱（它与您的合并相同，但“更干净”）
通过使用您的答案并将参数作为公共列添加到所有数据帧，我能够将它们与通用合并操作合并在一起。您可以在我编辑的问题中查看详细信息。
gr8！请记住，您可以在需要时部分使用reset_index（要将索引中的级别带到列空间，请使用levels 参数）

问题

到目前为止我得到了什么

指数

个人DataFrames

合并 Observables

合并所有数据

更新

个人`DataFrame`s