【问题标题】:Pandas - Convert a categorical column to binary encoded formPandas - 将分类列转换为二进制编码形式
【发布时间】:2017-07-31 12:40:06
【问题描述】:

我有一个看起来像这样的数据集 -

     yyyy      month        tmax         tmin
0    1908    January         5.0         -1.4
1    1908   February         7.3          1.9
2    1908      March         6.2          0.3
3    1908      April         7.4          2.1
4    1908        May        16.5          7.7
5    1908       June        17.7          8.7
6    1908       July        20.1         11.0
7    1908     August        17.5          9.7
8    1908  September        16.3          8.4
9    1908    October        14.6          8.0
10   1908   November         9.6          3.4
11   1908   December         5.8         -0.3
12   1909    January         5.0          0.1
13   1909   February         5.5         -0.3
14   1909      March         5.6         -0.3
15   1909      April        12.2          3.3
16   1909        May        14.7          4.8
17   1909       June        15.0          7.5
18   1909       July        17.3         10.8
19   1909     August        18.8         10.7
20   1909  September        14.5          8.1
21   1909    October        12.9          6.9
22   1909   November         7.5          1.7
23   1909   December         5.3          0.4
24   1910    January         5.2         -0.5
...

它有四个变量——yyyymonthtmax(最高温度)和tmin

我想在预测时将月份列用作变量,因此想将其转换为其二进制编码版本。本质上,我想将十二个变量添加到名为January 的数据集直到December,如果特定行的月份为“一月”,那么列January 应标记为1,其余的新添加11 列应为0

我查看了数据透视表,但这对我的事业没有帮助。关于如何以简单优雅的方式做到这一点的任何想法?

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    我觉得你需要get_dummies:

    df = pd.get_dummies(df['month'])
    

    如果需要将新列添加到原始列并删除 month,请使用 joinpop

    df2 = df.join(pd.get_dummies(df.pop('month')))
    print (df2.head())
       yyyy  tmax  tmin  April  August  December  February  January  July  June  \
    0  1908   5.0  -1.4      0       0         0         0        1     0     0   
    1  1908   7.3   1.9      0       0         0         1        0     0     0   
    2  1908   6.2   0.3      0       0         0         0        0     0     0   
    3  1908   7.4   2.1      1       0         0         0        0     0     0   
    4  1908  16.5   7.7      0       0         0         0        0     0     0   
    
       March  May  November  October  September  
    0      0    0         0        0          0  
    1      0    0         0        0          0  
    2      1    0         0        0          0  
    3      0    0         0        0          0  
    4      0    1         0        0          0  
    

    如果不需要删除列month

    df2 = df.join(pd.get_dummies(df['month']))
    print (df2.head())
       yyyy     month  tmax  tmin  April  August  December  February  January  \
    0  1908   January   5.0  -1.4      0       0         0         0        1   
    1  1908  February   7.3   1.9      0       0         0         1        0   
    2  1908     March   6.2   0.3      0       0         0         0        0   
    3  1908     April   7.4   2.1      1       0         0         0        0   
    4  1908       May  16.5   7.7      0       0         0         0        0   
    
       July  June  March  May  November  October  September  
    0     0     0      0    0         0        0          0  
    1     0     0      0    0         0        0          0  
    2     0     0      1    0         0        0          0  
    3     0     0      0    0         0        0          0  
    4     0     0      0    1         0        0          0  
    

    如果需要排序列,还有更多可能的解决方案 - 使用 reindexreindex_axis

    months = ['January', 'February', 'March','April' ,'May',  'June', 'July', 'August', 'September','October', 'November','December']
    
    df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
    print (df1.head())
       January  February  March  April  May  June  July  August  September  \
    0        1         0      0      0    0     0     0       0          0   
    1        0         1      0      0    0     0     0       0          0   
    2        0         0      1      0    0     0     0       0          0   
    3        0         0      0      1    0     0     0       0          0   
    4        0         0      0      0    1     0     0       0          0   
    
       October  November  December  
    0        0         0         0  
    1        0         0         0  
    2        0         0         0  
    3        0         0         0  
    4        0         0         0  
    
    df1 = pd.get_dummies(df['month']).reindex(columns=months)
    print (df1.head())
       January  February  March  April  May  June  July  August  September  \
    0        1         0      0      0    0     0     0       0          0   
    1        0         1      0      0    0     0     0       0          0   
    2        0         0      1      0    0     0     0       0          0   
    3        0         0      0      1    0     0     0       0          0   
    4        0         0      0      0    1     0     0       0          0   
    
       October  November  December  
    0        0         0         0  
    1        0         0         0  
    2        0         0         0  
    3        0         0         0  
    4        0         0         0  
    

    或将month 列转换为ordered categorical

    df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
    print (df1.head())
       January  February  March  April  May  June  July  August  September  \
    0        1         0      0      0    0     0     0       0          0   
    1        0         1      0      0    0     0     0       0          0   
    2        0         0      1      0    0     0     0       0          0   
    3        0         0      0      1    0     0     0       0          0   
    4        0         0      0      0    1     0     0       0          0   
    
       October  November  December  
    0        0         0         0  
    1        0         0         0  
    2        0         0         0  
    3        0         0         0  
    4        0         0         0  
    

    【讨论】:

      【解决方案2】:

      IIUC,

      您可以使用assign** 解包运算符和pd.get_dummies

      df.assign(**pd.get_dummies(df['month']))
      

      输出:

          yyyy      month  tmax  tmin  April  August  December  February  January  \
      0   1908    January   5.0  -1.4      0       0         0         0        1   
      1   1908   February   7.3   1.9      0       0         0         1        0   
      2   1908      March   6.2   0.3      0       0         0         0        0   
      3   1908      April   7.4   2.1      1       0         0         0        0   
      4   1908        May  16.5   7.7      0       0         0         0        0   
      5   1908       June  17.7   8.7      0       0         0         0        0   
      6   1908       July  20.1  11.0      0       0         0         0        0   
      7   1908     August  17.5   9.7      0       1         0         0        0   
      8   1908  September  16.3   8.4      0       0         0         0        0   
      9   1908    October  14.6   8.0      0       0         0         0        0   
      10  1908   November   9.6   3.4      0       0         0         0        0   
      11  1908   December   5.8  -0.3      0       0         1         0        0   
      12  1909    January   5.0   0.1      0       0         0         0        1   
      13  1909   February   5.5  -0.3      0       0         0         1        0   
      14  1909      March   5.6  -0.3      0       0         0         0        0   
      15  1909      April  12.2   3.3      1       0         0         0        0   
      16  1909        May  14.7   4.8      0       0         0         0        0   
      17  1909       June  15.0   7.5      0       0         0         0        0   
      18  1909       July  17.3  10.8      0       0         0         0        0   
      19  1909     August  18.8  10.7      0       1         0         0        0   
      20  1909  September  14.5   8.1      0       0         0         0        0   
      21  1909    October  12.9   6.9      0       0         0         0        0   
      22  1909   November   7.5   1.7      0       0         0         0        0   
      23  1909   December   5.3   0.4      0       0         1         0        0   
      24  1910    January   5.2  -0.5      0       0         0         0        1   
      
          July  June  March  May  November  October  September  
      0      0     0      0    0         0        0          0  
      1      0     0      0    0         0        0          0  
      2      0     0      1    0         0        0          0  
      3      0     0      0    0         0        0          0  
      4      0     0      0    1         0        0          0  
      5      0     1      0    0         0        0          0  
      6      1     0      0    0         0        0          0  
      7      0     0      0    0         0        0          0  
      8      0     0      0    0         0        0          1  
      9      0     0      0    0         0        1          0  
      10     0     0      0    0         1        0          0  
      11     0     0      0    0         0        0          0  
      12     0     0      0    0         0        0          0  
      13     0     0      0    0         0        0          0  
      14     0     0      1    0         0        0          0  
      15     0     0      0    0         0        0          0  
      16     0     0      0    1         0        0          0  
      17     0     1      0    0         0        0          0  
      18     1     0      0    0         0        0          0  
      19     0     0      0    0         0        0          0  
      20     0     0      0    0         0        0          1  
      21     0     0      0    0         0        1          0  
      22     0     0      0    0         1        0          0  
      23     0     0      0    0         0        0          0  
      24     0     0      0    0         0        0          0 
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-09-08
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-03-12
        • 1970-01-01
        相关资源
        最近更新 更多