Pandas：如何按列值添加新的索引级别答案

【问题标题】：Pandas: How to add new index levels by column valuesPandas：如何按列值添加新的索引级别
【发布时间】：2019-08-28 13:12:34
【问题描述】：

我正在尝试简化对以下数据框的一些数据评估：

                                           3                     9
measurement_location voltage
NaN                  NaN      Gleichrichtung    ...  Gegenrichtung
                     NaN               > 50mm   ...  1mm < x < 5mm
B-Säule              9,5 V                 52   ...             41
                     13 V                  47   ...             55
                     15,5 V                61   ...             65
Scheibenmitte        9,5 V                 49   ...             60
                     13 V                  60   ...             57
                     15,5 V                69   ...             66
A-Säule              9,5 V                 46   ...             49
                     13 V                  50   ...             48
                     15,5 V                58   ...             58

数据框已从 Excel 工作表中读取，该工作表包含一个表，该表在前两列和前两行具有索引。实际上它是一个二维 MultiIndex 数据框。实际数据从第 3 行第 3 列开始。

前两列中的measurement_location 和voltage 是索引。

在前两行中有基于列的索引的值。我想将表转换为一个平面值列表 - 包括基于第 1 行和第 2 行值的新索引。

目标：

measurement_location voltage    direction       distance        value
B-Säule              9,5 V      Gleichrichtung  > 50mm          52    # col "3", 1st data row
                                Gegenrichtung   1mm < x < 5mm   41    # col "9", 1st data row
...
Scheibenmitte        9,5 V      Gleichrichtung  > 50mm          49    # column "3", 4th data row
                                Gegenrichtung   1mm < x < 5mm   60    # column "9", 4th data row
...

所以这意味着它想从

的 values 创建新的索引

第 0 行称之为“方向”
第 1 行称之为“距离”。

我想为一行添加一个新索引，但我找不到方法，如何将它添加回 MultiIndex...

# get line of measurements
measurements = idf.iloc[2]
# get new "index" by values of each values column information
column_values = idf.iloc[0]
pd.DataFrame(measurements).set_index(column_values).unstack()

# yields:
                (nan, nan)
B-Säule  9,5 V  Gleichrichtung     52
                Gleichrichtung     53
                Gleichrichtung     54
                Gleichrichtung     50
                Gleichrichtung     55
                Gleichrichtung     56
                Gegenrichtung      41
                Gegenrichtung      42
                Gegenrichtung      43
dtype: object

更新：使用一些数据的一些最小示例：

idx = pd.MultiIndex.from_product([
        ['A', 'B', 'C'],
        ['9', '13', '16']
    ],
    names=['measurement_location', 'voltage']
)

data = np.arange(36).reshape(9, 4)
df = pd.DataFrame(data, idx)


                               0   1   2   3
measurement_location voltage
A                    9         0   1   2   3
                     13        4   5   6   7
                     16        8   9  10  11
B                    9        12  13  14  15
                     13       16  17  18  19
                     16       20  21  22  23
C                    9        24  25  26  27
                     13       28  29  30  31
                     16       32  33  34  35

在这个最小示例中，每列的值共享相同的索引元组（就像上面的真实数据中的“3”列：(Gleichrichtung,> 50mm)）。

因此，对于每个值，我需要提取其列的索引元组并将其分配回现有的 MultiIndex。

就像 target 中描述的那样，最后我希望每个值有一行

measurement_location voltage    direction       distance        value
B-Säule              9,5 V      Gleichrichtung  > 50mm          52

我想避免 for 循环并使用 pandas 方法。

【问题讨论】：

您能否以更易于重现的格式发布示例数据？在所有空格字符和... 截断之间，很难重现输入数据。
是Gleichrichtung & Gegenrichtung 第一个数据集中列的名称吗？如果是这样，这意味着measurement_location & voltage 是索引？
Gleichrichtung & Gegenrichtung 是我想用来创建新索引级别的两个（分类）值。 measurement_location & voltage 是索引（MultiIndex）
@PeterLeimbigler：我添加了一些最小的示例数据（由于我缺乏 pandas 知识，我不知道如何使用包含 NaN 索引值的另外两行来重现原始结构）

标签： python pandas multi-index

【解决方案1】：

终于找到了解决办法：

                                           3                     9
measurement_location voltage
NaN                  NaN      Gleichrichtung    ...  Gegenrichtung
                     NaN               > 50mm   ...  1mm < x < 5mm
B-Säule              9,5 V                 52   ...             41
                     13 V                  47   ...             55
                     15,5 V                61   ...             65
Scheibenmitte        9,5 V                 49   ...             60
                     13 V                  60   ...             57
                     15,5 V                69   ...             66
A-Säule              9,5 V                 46   ...             49
                     13 V                  50   ...             48
                     15,5 V                58   ...             58

idf 是上层数据框：

# indexed dataframe
idf = tempdf.set_index([0, 1], ['measurement_location', 'voltage'])
# create a new multi index from the first two rows
midx = pd.MultiIndex.from_arrays([idf.iloc[0].values, idf.iloc[1].values])
# map it to the column indexes
idxdf = pd.DataFrame(pd.np.arange(3, 12), index=midx)

                                0
Gleichrichtung  > 50mm          3
                > 50mm          4
                > 50mm          5
                1mm < x < 5mm   6
                1mm < x < 5mm   7
                1mm < x < 5mm   8
Gegenrichtung   1mm < x < 5mm   9
                1mm < x < 5mm  10
                1mm < x < 5mm  11

# mapping column index to multi index
c2mi = dict(zip(idxdf.values.flat, idxdf.index.to_list()))
# create a series by using the columns values as a helper index for the reassignment
# of column index to the new multi index
new_df = idf[2:].stack().to_frame()
# assign the direction and distance to intermediate columns
# by mapping the respective element of the helper index (level 2)
# to it's direction/distance value
# new_df.index.map -> index -> values of the correspnding series
new_df['direction'] = new_df.index.map(lambda idx: c2mi[idx[2]][0]).to_series().values
new_df['distance'] = new_df.index.map(lambda idx: c2mi[idx[2]][1]).to_series().values
# drop the helper index
new_df.index = new_df.index.droplevel(2)
# rename the original index
new_df.index.set_names(['measurement_location', 'voltage'], inplace=True)
# set the new index levels
new_df = new_df.set_index(['direction', 'distance'], append=True)

我相信这可以做得更干净。为了完整起见，我只是想发布它。

【讨论】：

【解决方案2】：

import numpy as np
import pandas as pd
from itertools import cycle

idx = pd.MultiIndex.from_product([
        ['B-Säule', 'Scheibenmitte', 'A-Säule'],
        ['9.5 V', '13 V', '15.5 V']
    ],
    names=['measurement_location', 'voltage']
)

data = np.arange(18).reshape(9, 2)
df = pd.DataFrame(data, idx, columns = [3, 9])

我们有输入数据：

df
                                   3    9
measurement_location    voltage     
B-Säule                 9.5 V      0    1
                        13 V       2    3
                        15.5 V     4    5
Scheibenmitte           9.5 V      6    7
                        13 V       8    9
                        15.5 V     10   11
A-Säule                 9.5 V      12   13
                        13 V       14   15
                        15.5 V     16   17

首先，稍微清理一下数据：

# Rename columns
df.rename({3: 'Gleichrichtung', 9: 'Gegenrichtung'}, axis=1, inplace=True)

# Remove nan rows (if present)
df.reindex(df.index.dropna(), inplace=True)

现在，merge 和 stack 这 2 列在values 列中为我们提供了您想要的模式：

df_new = pd.concat([df["Gleichrichtung"], df["Gegenrichtung"]], axis=1).stack().to_frame('value')
df_new.index.set_names('direction', level=2, inplace=True)   # Rename index

最后，在distance列中添加：

seq = cycle(["> 50mm", "1mm < x < 5mm"])
df_new['distance'] = [next(seq) for count in range(df_new.shape[0])]

那么我们有：

df_new

                                                value   distance
measurement_location    voltage direction       
B-Säule                 9.5 V   Gleichrichtung  0       > 50mm
                                Gegenrichtung   1       1mm < x < 5mm
                        13 V    Gleichrichtung  2       > 50mm
                                Gegenrichtung   3       1mm < x < 5mm
                        15.5 V  Gleichrichtung  4       > 50mm
                                Gegenrichtung   5       1mm < x < 5mm
Scheibenmitte           9.5 V   Gleichrichtung  6       > 50mm
                                Gegenrichtung   7       1mm < x < 5mm
                        13 V    Gleichrichtung  8       > 50mm
                                Gegenrichtung   9       1mm < x < 5mm
                        15.5 V  Gleichrichtung  10      > 50mm
                                Gegenrichtung   11      1mm < x < 5mm
A-Säule                 9.5 V   Gleichrichtung  12      > 50mm
                                Gegenrichtung   13      1mm < x < 5mm
                        13 V    Gleichrichtung  14      > 50mm
                                Gegenrichtung   15      1mm < x < 5mm
                        15.5 V  Gleichrichtung  16      > 50mm
                                Gegenrichtung   17      1mm < x < 5mm

【讨论】：

在每列的前两行（它们有索引（NaN，NaN）和（NaN））每列有两个索引（方向和距离），所以距离不是“值” .是否可以将前两行作为 MultiIndex 并通过原始列索引将其应用于堆叠数据帧？
@maggie 是的，我忘了包括那个；我现在已经编辑了我的答案。我可能会使用df.reindex(df.index.dropna(), inplace=True) 删除该行，然后希望其他一切仍然正常。
我想到了类似的东西：midx = pd.MultiIndex.from_arrays([df.iloc[0].values, df.iloc[1].values]) 并将其映射为pd.DataFrame(np.arange(3, 12), index=midx)。因为重命名部分实际上不会“正确”，因为每列包含两个级别的新索引。
@maggie 这可能行得通，是的。如果您可以将nan 值用于原始帖子中的最小示例，那么我会看看它