选择 pandas 数据框中的列并使用多索引对它们进行分组答案

【问题标题】：Select columns in pandas dataframe and group them with multiindex选择 pandas 数据框中的列并使用多索引对它们进行分组
【发布时间】：2020-08-14 13:30:54
【问题描述】：

我有一个包含 126 列的巨大数据框。我想添加一个额外的级别（多索引），我希望有 5 个类别，因此 126 列中的每一列都属于相应的类别。我找到了许多定义级别并写下要附加到该级别的所有列的解决方案，这非常耗时，因为我必须对 126 列进行分组。有没有更快的方法来做到这一点？例如，像.iloc[:,9:44] 这样的切片列，因为我想将这 35 列归为一个类别？

数据框如下所示：

    df
        a    b     c...  d    e     f...  g    h    i...  j    k    l... n=126

 1     1.0  1.0   1.0   2.0   3.0   2.0   1.0  1.0  1.0  2.0   3.0   2.0 
 2     4.0  5.0   4.0   4.0   8.0   4.0   4.0  5.0  4.0  4.0   8.0   4.0
 3     6.0  1.0   6.0   7.0   8.0   7.0   6.0  1.0  6.0  7.0   8.0   7.0

解决方案如下所示：

    df2
              A          |        B         |       C          |       D    n=5
        a    b     c...  |  d     e    f... |  g    h     i... |   j   k  l n=126 

1      1.0  1.0   1.0    2.0  3.0   2.0    1.0  1.0   1.0    2.0  3.0   2.0
2      4.0  5.0   4.0    4.0  8.0   4.0    4.0  5.0   4.0    4.0  8.0   4.0
3      6.0  1.0   6.0    7.0  8.0   7.0    6.0  1.0   6.0    7.0  8.0   7.0

【问题讨论】：

添加一级的逻辑是什么？
a,b,c,d 等代表生物标志物的名称，而 A、B、C、D 代表生物标志物组。我想添加组级别以获得更好的顺序。

标签： pandas dataframe slice multi-index

【解决方案1】：

如果想将每个 N 个值分配给单独的类别，为每个 N 个块创建字典，然后 map:

#https://stackoverflow.com/a/312464/2901002
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

L = ['A','B','C','D']
d = {v: k for k, x in zip(L, chunks(df.columns, 3)) for v in x}
print (d)
{'a': 'A', 'b': 'A', 'c': 'A', 
 'd': 'B', 'e': 'B', 'f': 'B', 
 'g': 'C', 'h': 'C', 'i': 'C', 
 'j': 'D', 'k': 'D', 'l': 'D'}

df.columns = [df.columns.map(d), df.columns]
print (df)

     A              B              C              D          
     a    b    c    d    e    f    g    h    i    j    k    l
1  1.0  1.0  1.0  2.0  3.0  2.0  1.0  1.0  1.0  2.0  3.0  2.0
2  4.0  5.0  4.0  4.0  8.0  4.0  4.0  5.0  4.0  4.0  8.0  4.0
3  6.0  1.0  6.0  7.0  8.0  7.0  6.0  1.0  6.0  7.0  8.0  7.0

编辑：如果需要按位置设置列：

d1 = {'A':df.columns[0:3],
      'B':df.columns[3:6],
      'C':df.columns[6:9],
      'D':df.columns[9:12]}
print (d1)
{'A': Index(['a', 'b', 'c'], dtype='object'), 
 'B': Index(['d', 'e', 'f'], dtype='object'), 
 'C': Index(['g', 'h', 'i'], dtype='object'), 
 'D': Index(['j', 'k', 'l'], dtype='object')}

d =  {v: k for k, x in d1.items() for v in x}
print (d)
{'a': 'A', 'b': 'A', 'c': 'A', 
 'd': 'B', 'e': 'B', 'f': 'B', 
 'g': 'C', 'h': 'C', 'i': 'C', 
 'j': 'D', 'k': 'D', 'l': 'D'}

df.columns = [df.columns.map(d), df.columns]
print (df)
     A              B              C              D          
     a    b    c    d    e    f    g    h    i    j    k    l
1  1.0  1.0  1.0  2.0  3.0  2.0  1.0  1.0  1.0  2.0  3.0  2.0
2  4.0  5.0  4.0  4.0  8.0  4.0  4.0  5.0  4.0  4.0  8.0  4.0
3  6.0  1.0  6.0  7.0  8.0  7.0  6.0  1.0  6.0  7.0  8.0  7.0

【讨论】：

这行得通，谢谢。但是有些列在我的数据集中分配给了错误的组。是否可以在不写下名称的情况下准确定义哪些列属于哪个类别？例如 df.iloc[:,0:8] = group1, df.iloc[:,9:44] = group2 等等？
@Kuki - 当然，等一下。