Python Pandas：如何从列表列创建二进制矩阵？答案

【问题标题】：Python Pandas: How to create a binary matrix from column of lists?Python Pandas：如何从列表列创建二进制矩阵？
【发布时间】：2016-07-29 00:33:22
【问题描述】：

我有一个 Python Pandas DataFrame，如下所示：

a, b 是代表用户特征列表的字符串

如何将其转换为用户特征的二进制矩阵，如下所示：

     a    b    c    d    e
0    1    1    0    0    0
1    0    0    1    0    0
2    0    0    0    1    0
3    0    0    0    0    1

我看到了一个类似的问题Creating boolean matrix from one column with pandas，但该列不包含列表条目。

我已经尝试了这些方法，有没有办法将两者合并：

pd.get_dummies()

pd.get_dummies(df[1])


   a, b  c  d  e
0     1  0  0  0
1     0  1  0  0
2     0  0  1  0
3     0  0  0  1

df[1].apply(lambda x: pd.Series(x.split()))

还对创建这种类型的二进制矩阵的不同方法感兴趣！

感谢任何帮助！

谢谢

【问题讨论】：

标签： python pandas dataframe sparse-matrix binary-matrix

【解决方案1】：

不久前我写了一个支持分组的通用函数：

def sublist_uniques(data,sublist):
    categories = set()
    for d,t in data.iterrows():
        try:
            for j in t[sublist]:
                categories.add(j)
        except:
            pass
    return list(categories)

def sublists_to_dummies(f,sublist,index_key = None):
    categories = sublist_uniques(f,sublist)
    frame = pd.DataFrame(columns=categories)
    for d,i in f.iterrows():
        if type(i[sublist]) == list or np.array:
            try:
                if index_key != None:
                    key = i[index_key]
                    f =np.zeros(len(categories))
                    for j in i[sublist]:
                        f[categories.index(j)] = 1
                    if key in frame.index:
                        for j in i[sublist]:
                            frame.loc[key][j]+=1
                    else:
                        frame.loc[key]=f
                else:
                    f =np.zeros(len(categories))
                    for j in i[sublist]:
                        f[categories.index(j)] = 1
                    frame.loc[d]=f
            except:
                pass

    return frame

In [15]: a
Out[15]:
   a group     labels
0  1   new     [a, d]
1  2   old  [a, g, h]
2  3   new  [i, m, a]

In [16]: sublists_to_dummies(a,'labels')
Out[16]:
   a  d  g  i  h  m
0  1  1  0  0  0  0
1  1  0  1  0  1  0
2  1  0  0  1  0  1

In [17]: sublists_to_dummies(a,'labels','group')
Out[17]:
     a  d  g  i  h  m
new  2  1  0  1  0  1
old  1  0  1  0  1  0

【讨论】：

【解决方案2】：

我认为你可以使用：

df = df.iloc[:,0].str.split(', ', expand=True)
       .stack()
       .reset_index(drop=True)
       .str.get_dummies()

print df
   a  b  c  d  e
0  1  0  0  0  0
1  0  1  0  0  0
2  0  0  1  0  0
3  0  0  0  1  0
4  0  0  0  0  1

已编辑：

print df.iloc[:,0].str.replace(' ','').str.get_dummies(sep=',')
   a  b  c  d  e
0  1  1  0  0  0
1  0  0  1  0  0
2  0  0  0  1  0
3  0  0  0  0  1

【讨论】：

没有必要将这么多操作链接在一起只是为了使它成为一个单行..
有趣的是，适用于 10,000 行，但 iPython 内核在 100,000 行向上死亡，将尝试以 10,000 个块为单位进行计算并垂直连接。
@jezrael，我意识到这实际上增加了一个额外的行，这是不可取的，有什么办法解决这个问题吗？
我不明白，你能解释一下吗？
@jezrael，在原始矩阵中，只有 0-3 行，这应该在输出中保持，我现在将更新我的问题输出！