【问题标题】:How to create bins and assign labels based on a given condition pandas如何根据给定条件 pandas 创建 bin 并分配标签
【发布时间】:2021-09-23 09:44:58
【问题描述】:

我有一个 pandas 数据框,我想从原始列 (A-J) 创建一个新列 (new1,new2,new3,new4,new5,new6,new7,new8,new9,new10) )。条件是通过将 bin 分配给列 (A-J) 中的值来创建新列。

要分配的垃圾箱:

0
1.0 - 1.19
1.2 – 1.39
1.4 – 1.59
1.6 – 1.79
1.8 – 1.99
2.0 – 2.99
3.0 – 3.99
4.0 – 4.99
5.0 – 5.99
6.0 – 6.99
7.0 – 7.99
8.0 – 8.99
9.0 – 9.99
10.0+

数据帧:

    A         B       C        D      E      F       G       H       I       J
0   34.00   5.18    19.10   70.00   0.00    1.10    3.30    2.96    0.00    1.45
1   3.65    7.20    0.00    3.15    14.00   0.00    8.20    13.50   0.00    0.00
2   6.60    12.00   104.94  3.60    24.00   6.54    23.00   4.83    0.00    0.00
3   2.16    5.30    0.00    8.40    31.48   0.00    6.40    26.44   0.00    0.00
4   4.66    268.86  160.00  19.14   1.58    17.77   47.57   88.01   0.00    0.00
5   7.40    2.64    0.00    4.90    3.53    0.00    46.00   0.00    0.00    0.00
6   6.18    60.00   8.06    13.50   6.20    3.05    9.00    38.90   0.00    0.00
7   6.51    5.30    6.20    17.00   27.00   15.00   4.51    11.50   0.00    0.00
8   3.25    40.72   4.37    8.02    0.00    110.00  16.25   3.95    75.00   0.00

要添加到 Dataframe 的新列应具有表单。 预期输出形式

new1          new2     new3  new4      new5    new6       new7      new8     new9   new10

10+         5.0–5.99   10+   10+       0.00   1.0-1.19  3.0–3.99    2.0–2.99  0.00  1.4–1.59
3.0–3.99    7.0–7.99  0.00  3.0–3.99   10+    0.00      8.0 – 8.99  10+ 0.00  0.00  0.00

我尝试使用 pandas 中的 pd.cut 函数创建新的 DataFrame,并为这个 bin 分配标签。但创建的垃圾箱数量最多为 30 个而不是 15 个。我也尝试为垃圾箱分配标签,但在此过程中出现错误

labels = ['0','1.0 - 1.19' ,'1.2 – 1.39','1.4 – 1.59', '1.6 – 1.79','1.8 – 1.99','2.0 – 2.99','3.0 – 3.99','4.0 – 4.99','5.0 – 5.99',\
          '6.0 – 6.99','7.0 – 7.99','8.0 – 8.99','9.0 – 9.99','10.0+']
bins = [0,1.0,1.19,1.2,1.39,1.4,1.59,1.6,1.79,1.8,1.99,2.0,2.99,3.0,3.99,4.0,4.99,5.0,5.99,6.0,6.99,7.0,7.99,8.0,8.99,9.0,9.99,10.0,np.inf]

df[['new1','new2','new3','new4','new5','new6','new7','new8','new9','new10']] = \
                          df[['A', 'B', 'C','D','E','F','G','H','I','J']].apply(lambda x: \
                         pd.cut(x,bins),axis=1).apply(lambda x: x.cat.add_categories([0.00])).fillna(0.00)
     

ValueError: Bin labels must be one fewer than the number of bin edges

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    解决方案涉及编辑您的 bins 列表:

    # Same labels as yours
    labels = ['0', '1.0 - 1.19', '1.2 – 1.39', '1.4 – 1.59', '1.6 – 1.79', 
              '1.8 – 1.99', '2.0 – 2.99', '3.0 – 3.99', '4.0 – 4.99', '5.0 – 5.99', 
              '6.0 – 6.99', '7.0 – 7.99', '8.0 – 8.99', '9.0 – 9.99', '10.0+']
    
    # Define the edges between bins
    bins = [0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0, 4.0, 
            5.0, 6.0, 7.0, 8.0, 9.0, 10.0, np.inf]
    
    # pd.cut each column, with each bin closed on left and open on right
    res = df.apply(lambda x: pd.cut(x, bins=bins, labels=labels, right=False))
    
    # rename columns and print result
    res.columns = [f'new{i+1}' for i in range(df.shape[1])]
    
    print(res)
    
             new1        new2        new3        new4        new5        new6        new7        new8   new9       new10
    0       10.0+  5.0 – 5.99       10.0+       10.0+           0  1.0 - 1.19  3.0 – 3.99  2.0 – 2.99      0  1.4 – 1.59
    1  3.0 – 3.99  7.0 – 7.99           0  3.0 – 3.99       10.0+           0  8.0 – 8.99       10.0+      0           0
    2  6.0 – 6.99       10.0+       10.0+  3.0 – 3.99       10.0+  6.0 – 6.99       10.0+  4.0 – 4.99      0           0
    3  2.0 – 2.99  5.0 – 5.99           0  8.0 – 8.99       10.0+           0  6.0 – 6.99       10.0+      0           0
    4  4.0 – 4.99       10.0+       10.0+       10.0+  1.4 – 1.59       10.0+       10.0+       10.0+      0           0
    5  7.0 – 7.99  2.0 – 2.99           0  4.0 – 4.99  3.0 – 3.99           0       10.0+           0      0           0
    6  6.0 – 6.99       10.0+  8.0 – 8.99       10.0+  6.0 – 6.99  3.0 – 3.99  9.0 – 9.99       10.0+      0           0
    7  6.0 – 6.99  5.0 – 5.99  6.0 – 6.99       10.0+       10.0+       10.0+  4.0 – 4.99       10.0+      0           0
    8  3.0 – 3.99       10.0+  4.0 – 4.99  8.0 – 8.99           0       10.0+       10.0+  3.0 – 3.99  10.0+           0
    

    说明

    作为bins 传递给pd.cut() 的标量序列“定义了允许不均匀宽度的bin 边缘”:https://pandas.pydata.org/docs/reference/api/pandas.cut.html

    默认情况下,每个垃圾箱左侧打开,右侧关闭。要切换它,请传递right=False(这也会关闭每个 bin 的左边缘)。

    例如,bins=[0, 1.0, 1.19, 1.2] 导致 pd.cut 产生 3 个间隔:[0.0, 1.0) < [1.0, 1.19) < [1.19, 2.0)

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-12-06
      • 1970-01-01
      • 2019-12-09
      • 1970-01-01
      • 1970-01-01
      • 2021-08-15
      • 1970-01-01
      • 2018-03-01
      相关资源
      最近更新 更多