Python：将多个二进制列转换为单个分类列答案

【问题标题】：Python: Converting multiple binary columns to single categorical columnPython：将多个二进制列转换为单个分类列
【发布时间】：2018-04-24 18:19:18
【问题描述】：

我有一个包含 170 列的 csv 文件数据集，前 5 列包含唯一标识符（平台、ID、日期、通话时长、姓名）。其余列 175 包含涵盖 10 个类别的二进制数据。我想压缩这些列，使我的数据框中的列数为 15。包括下面的示例：

import pandas as pd

df1 = pd.DataFrame({'Platform': ['Telephone', 'Chat', 'Text'], 'ID': [1, 2, 
3], 'Length': [1545,1532,1511], 'Name': ['andy', 'helen', 'peter'], 'Problem: 
A':[0,1,0], 'Problem: B':[1,0,0], 'Problem: C': [0,0,1], 'Solution: A': 
[0,1,0], 'Solution: B':[1,0,0], 'Solution: C': [0,0,1]})

输出是：

df.head()

ID  Date        Length\\
1   2015-10-16    1545
2   2015-10-09    1532
3   2015-10-13    1511 

Name Problem: A Problem: B  Problem: C  Solution: A Solution: B Solution: C
andy         0          1           0            0           1           0
helen        1          0           0            1           0           0
peter        0          0           1            0           0           1

我希望数据框看起来像什么：

  Platform ID Length  Name   Problem  Solution
  Telephone 1 1545    andy    B        B
  Chat      2 1532    helen   A        A
  Text      3 1511    peter   C        C

仅供参考，这不是完整的数据框。总共有 170 个列，我想转换成 15 个。

【问题讨论】：

stackoverflow.com/questions/26762100/…的可能重复

标签： python pandas binary categorical-data

【解决方案1】：

您可以使用groupby + apply 与列上的点积；

df = df.set_index('Name')
df.groupby(df.columns.str.split(':').str[0], axis=1).apply(
    lambda x: x.dot(x.columns.str.split(': ').str[1])
)

      Problem Solution
Name                  
andy        B        B
helen       A        A
peter       C        C

【讨论】：

当我这样做时，我收到“ValueError: Grouper and axis must be the same length”消息。我确定这是我不理解的代码中的某些内容。您介意告诉我上面建议的每个代码的作用吗？我熟悉 df.groupby() 和 .apply()，但不熟悉 lambda 或 x.dot。
据我所知，这指定列名将是 (' : ') 之前的单词，而值将是 (' : ') 之后的单词。
@REFER 这意味着您传递的 df.columns.str.split(':').str[0] 和数据框中的列数不一样。可以看看吗？

【解决方案2】：

我创建了这个自定义函数，它将为您服务。我从这个stackoverflow article得到了这个想法

def condenseCols(data,finalCol,*cols):
    cols = list(cols)
    x = data[cols] # Slice the cols
    x = x.idxmax(axis=1) 
    # x is now a series, holding column name of the max value in the row i.e one of the column from cols
    x = x.apply(lambda s : s.split(": ")[1]) # extract only the prefix (A,B,C)

    data[finalCol] = x
    data = data.drop(cols, axis=1, inplace=True) # Drop the columns : cols
    return data

通过传递要压缩的列名以及列的最终名称来调用此方法

condenseCols(df1,'Problem','Problem: A','Problem: B','Problem: C')
condenseCols(df1,'Solution','Solution: A','Solution: B','Solution: C')

还有其他方法可以做到这一点，如文章stackoverflow article中所述

【讨论】：