内存高效的热编码 Pandas答案

【问题标题】：Memory Efficient Hot Encode Pandas内存高效的热编码 Pandas
【发布时间】：2017-05-27 05:14:50
【问题描述】：

编辑：仍在寻找当两个数据集具有不同列时有效的答案！

我正在尝试同等对两个数据集中的特定列进行热编码。该列在不同的数据集中具有不同的值，因此简单的热编码会导致不同的列。预期结果：

DATASET A                           
col1    col2    target                  
a        1        1                 
b        2        2                 
c        2        3                 
d        3        3                 

DATASET B                           
col1    col2    target                  
d         2      2                  
h         4      3                  
g         2      2                  
b         3      3                  

After encoding col 1:                           

New dataset A                           

col2    target  a   b   c   d   h   g
1          1    1   0   0   0   0   0
2          2    0   1   0   0   0   0
2          3    0   0   1   0   0   0
3          3    0   0   0   1   0   0

New dataset B                           

col2    target  a   b   c   d   h   g
2          2    0   0   0   1   0   0
4          3    0   0   0   0   1   0
2          2    0   0   0   0   0   1
3          3    0   1   0   0   0   0

以下实现有效，但内存效率非常低，并且由于 MemoryErrors 经常使我的计算机崩溃。

 def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True,drop_first = True):
        print("Hot encoding {} for both datasets".format(column_name))
        cols_in_df_but_not_in_df2 = set(df[column_name]).difference(set(df2[column_name]))
        cols_in_df2_but_not_in_df = set(df2[column_name]).difference(set(df[column_name]))

        dummy_df_to_concat_to_df = pd.DataFrame(0,index=df.index,columns = cols_in_df2_but_not_in_df)
        dummy_df_to_concat_to_df2 = pd.DataFrame(0,index=df2.index,columns = cols_in_df_but_not_in_df2)

        dummy_df_to_concat_to_df = dummy_df_to_concat_to_df.to_sparse()
        dummy_df_to_concat_to_df2 = dummy_df_to_concat_to_df2.to_sparse()

        encoded = pd.get_dummies(df[column_name],sparse=sparse)
        encoded = pd.concat([encoded,dummy_df_to_concat_to_df],axis = 1)
        encoded_2 = pd.get_dummies(df2[column_name],sparse=sparse)
        encoded_2 = pd.concat([encoded_2,dummy_df_to_concat_to_df2],axis = 1)

        encoded_df = pd.concat([df,encoded],axis=1)
        encoded_df2 = pd.concat([df2,encoded_2],axis=1)

        del encoded_df[column_name]
        del encoded_df2[column_name]

        return encoded_df,encoded_df2

有没有更好的方法来做到这一点？

谢谢！ :)

【问题讨论】：

请不要将数据或代码作为图像。这样做会阻碍愿意提供帮助的人，因为他们必须手动输入示例数据。
@HaleemurAli 抱歉，我认为它让它看起来更干净。我现在就修！ :)
从您的示例看来，您可以只附加数据集、一个热编码，然后根据索引或标志变量在事后分离。这有什么原因不起作用吗？

标签： python pandas

【解决方案1】：

您可以将希望编码的列设为Category 类型列，并利用包括get_dummies 方法在内的pandas 方法尊重此类列可能具有在任何特定DataFrame 中未观察到的值这一事实.这使您可以避免两个 DataFrame 的任何合并/连接，并使该方法不知道是否有任何列出现在一个 DataFrame 中但不是两者都出现。 Categorical columns 的文档。

我正在使用 pandas v0.20.1。

import numpy as np
import pandas as pd
import string

dfa = pd.DataFrame.from_dict({
    'col1': np.random.choice([ltr for ltr in string.ascii_lowercase[:4]], 5)
    , 'col2b': np.random.choice([1, 2, 3], 5)
    , 'target': np.random.choice([1, 2, 3], 5)
    })

dfb = pd.DataFrame.from_dict({
    'col1': np.random.choice([ltr for ltr in string.ascii_lowercase[2:8]], 7)
    , 'col2b': np.random.choice(['foo', 'bar', 'baz'], 7)
    , 'target': np.random.choice([1, 2, 3], 7)
    })

dfa：

  col1  col2b  target
0    b      3       1
1    d      3       3
2    b      3       3
3    a      2       3
4    c      1       3

dfb:

  col1 col2b  target
0    g   foo       2
1    c   bar       1
2    h   baz       3
3    c   baz       3
4    d   baz       3
5    d   bar       2
6    d   foo       3

求两个 DataFrame 中观察到的 col1 值的并集：

col1b = set(dfb.col1.unique())
col1a = set(dfa.col1.unique())
combined_cats = list(col1a.union(col1b))

在两个 DataFrame 上定义 col1 的允许值：

# Use these statements if `col1` is a 'Category' dtype.
# dfa['col1'] = dfa.col1.cat.set_categories(combined_cats)
# dfb['col1'] = dfb.col1.cat.set_categories(combined_cats)
# Otherwise, use these statements.
dfa['col1'] = dfa.col1.astype('category', categories=combined_cats)
dfb['col1'] = dfb.col1.astype('category', categories=combined_cats)

newdfa = pd.get_dummies(dfa, columns=['col1'])
newdfb = pd.get_dummies(dfb, columns=['col1'])

newdfa：

   col2b  target  col1_g  col1_b  col1_c  col1_d  col1_h  col1_a
0      3       1       0       1       0       0       0       0
1      3       3       0       0       0       1       0       0
2      3       3       0       1       0       0       0       0
3      2       3       0       0       0       0       0       1
4      1       3       0       0       1       0       0       0

newdfb：

  col2b  target  col1_g  col1_b  col1_c  col1_d  col1_h  col1_a
0   foo       2       1       0       0       0       0       0
1   bar       1       0       0       1       0       0       0
2   baz       3       0       0       0       0       1       0
3   baz       3       0       0       1       0       0       0
4   baz       3       0       0       0       1       0       0
5   bar       2       0       0       0       1       0       0
6   foo       3       0       0       0       1       0       0

【讨论】：

【解决方案2】：

根据您的描述，这可以通过在一次热编码之前简单地附加数据帧来完成。

combined = a.append(b).reset_index(drop=True)
combinedDummies = pd.get_dummies(combined, columns=['col1'])

newA = combinedDummies.iloc[0:a.shape[0]]
newB = combinedDummies.iloc[a.shape[0]:]

newA
#   col2    target  col1_a  col1_b  col1_c  col1_d  col1_g  col1_h
#   0   1   1   1   0   0   0   0   0
#   1   2   2   0   1   0   0   0   0
#   2   2   3   0   0   1   0   0   0
#   3   3   3   0   0   0   1   0   0


newB
#   col2    target  col1_a  col1_b  col1_c  col1_d  col1_g  col1_h
#   4   2   2   0   0   0   1   0   0
#   5   4   3   0   0   0   0   0   1
#   6   2   2   0   0   0   0   1   0
#   7   3   3   0   1   0   0   0   0

【讨论】：

嘿，我刚刚意识到，如果两个数据集有一些不同的列，这将不起作用
对。但从概念上讲，您不需要所有列来创建虚拟变量。您可以只附加每个数据集中的两个相关列而不删除索引。我会包含一个标志变量，指示它属于哪个数据集。然后你可以在基于标志和索引的事实之后合并其他变量。
或者，您可以创建“主”编码，然后在合并之前分别重新索引每个数据帧。见：stackoverflow.com/questions/41492300/…
你的意思是只附加有问题的列？