我必须循环吗？有没有更快的方法来构建虚拟变量？答案

【问题标题】：Do I had to loop? Is there a faster way to build dummy variables?我必须循环吗？有没有更快的方法来构建虚拟变量？
【发布时间】：2020-09-16 21:56:24
【问题描述】：

我有一些看起来像的植物数据（但我最多有 7 个属性）：

     Unnamed: 0     plant          att_1           att_2 ...
0            0     plant_a         sunlover        tall
1            1     plant_b         waterlover      sunlover
2            2     plant_c         fast growing    sunlover

我尝试使用 pandas get_dummies 之类的：

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],'C': [1, 2, 3]})

pd.get_dummies(df, prefix=['col1', 'col2']):

 C  col1_a  col1_b  col2_a  col2_b  col2_c
 0  1       1       0       0       1       0
 1  2       0       1       1       0       0
 2  3       1       0       0       0       1

但是 sunlover 应该被编码为 1 但它在 att_1 或 att_2 中。然后我会得到大约 30 个虚拟变量而不是 7 * 30 = 210 个变量。我试图遍历整个集合并为每个虚拟对象添加值：

for count, plants in enumerate(data_plants.iterrows()):
  print("First", count, plants)
  for attribute in plants:
        print("Second", count, attribute)

代码只是打印，因为我看到了浪费代码的问题。这项工作，但速度不够快，无法用于 100k 和更多行。我想过使用 .value_counts() 来获取属性，然后访问数据帧虚拟变量以将其更新为 1，但随后我将覆盖该属性。目前，我有点迷茫，没有想法。也许我必须使用其他包？

目标是这样的：

     Unnamed: 0     plant          att_1           att_2       sunlover      waterlover     tall  ...
0            0     plant_a         sunlover        tall        1             0              1
1            1     plant_b         waterlover      sunlover    1             1              0
2            2     plant_c         fast growing    sunlover    1             0              0

【问题讨论】：

您应该在扁平化的数据框值上使用numpy.unique，然后对其进行整形

标签： python pandas variables regression dummy-variable

【解决方案1】：

将get_dummies 与max 一起使用：

c = ['att_1', 'att_2']
df1 = df.join(pd.get_dummies(df[c], prefix='', prefix_sep='').max(axis=1, level=0))
print (df1)
     plant         att_1     att_2  fast growing  sunlover  waterlover  tall
0  plant_a      sunlover      tall             0         1           0     1
1  plant_b    waterlover  sunlover             0         1           1     0
2  plant_c  fast growing  sunlover             1         1           0     0

3k 行的性能，在实际数据中应该不同：

df = pd.concat([df] * 1000, ignore_index=True)


In [339]: %%timeit
     ...: 
     ...: c = ['att_1', 'att_2']
     ...: df1 = df.join(pd.get_dummies(df[c], prefix='', prefix_sep='').max(axis=1, level=0))
     ...: 
     ...: 
10.7 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %%timeit
     ...: attCols = df[['att_1', 'att_2']]
     ...: colVals = pd.Index(np.sort(attCols.stack().unique()))
     ...: def myDummies(row):
     ...:     return pd.Series(colVals.isin(row).astype(int), index=colVals)
     ...: 
     ...: df1 = df.join(attCols.apply(myDummies, axis=1))
     ...: 
     ...: 
1.03 s ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

另一种解决方案：

In [133]: %%timeit
     ...: c = ['att_1', 'att_2']
     ...: df1 = (df.join(pd.DataFrame([dict.fromkeys(x, 1) for x in df[c].to_numpy()])
     ...:                  .fillna(0)
     ...:                  .astype(np.int8)))
     ...:                  
13.1 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

【讨论】：

感谢您的快速答复。我会使用你的代码。它需要另一个答案的 1/2 时间，并且比我的循环快得多。
@NilsGradwanderer - 添加了新的解决方案，如果有许多不同的类别，可能会更快。

【解决方案2】：

你需要的只是在某些方面类似 get_dummies，但你应该采取其他方式。

定义一个 df 的视图，仅限于您的“属性”列：

attCols = df[['att_1', 'att_2']]

在您的目标版本中，在此处添加其他“属性”列。

然后定义一个包含唯一属性名称的索引：

colVals = pd.Index(np.sort(attCols.stack().unique()))

第三步是定义一个函数，计算结果当前行：

def myDummies(row):
    return pd.Series(colVals.isin(row).astype(int), index=colVals)

最后一步是加入这个函数的应用结果从 attCols 到每一行：

df = df.join(attCols.apply(myDummies, axis=1))

您的样本数据的结果是：

     plant         att_1     att_2  fast growing  sunlover  tall  waterlover
0  plant_a      sunlover      tall             0         1     1           0
1  plant_b    waterlover  sunlover             0         1     0           1
2  plant_c  fast growing  sunlover             1         1     0           0

【讨论】：

OP 只放置代码打印计数，不生成额外的列，所以没有什么可以比较我的代码。请注意，get_dummies 生成的结果与我的完全不同。
感谢您的回答。它解决了我的问题，但我会坚持@jezrael 的答案。即使使用我的测试集也需要 1/2 的时间。