按值分箱，最后一个分箱除外答案

【问题标题】：Binning by value, except last bin按值分箱，最后一个分箱除外
【发布时间】：2016-08-17 22:23:42
【问题描述】：

我正在尝试按如下方式对数据进行分类：

pd.cut(df['col'], np.arange(0,1.2, 0.2),include_lowest=True))

但我想确保任何大于 1 的数据也包含在最后一个 bin 中。我可以用几行代码做到这一点，但想知道是否有人知道单行/更 Python 的方式来做到这一点？

PS - 我不想做一个 qcut——我需要用它们的值来分隔这些 bin，而不是记录的计数。

【问题讨论】：

你试过pd.cut(..., right=False)吗？

标签： python pandas dataframe categorical-data binning

【解决方案1】：

解决方案1：准备labels（使用DF的前5行）并将bins参数中的1替换为np.inf：

In [67]: df
Out[67]:
          a         b         c
0  1.698479  0.337989  0.002482
1  0.903344  1.830499  0.095253
2  0.152001  0.439870  0.270818
3  0.621822  0.124322  0.471747
4  0.534484  0.051634  0.854997
5  0.980915  1.065050  0.211227
6  0.809973  0.894893  0.093497
7  0.677761  0.333985  0.349353
8  1.491537  0.622429  1.456846
9  0.294025  1.286364  0.384152

In [68]: labels = pd.cut(df.a.head(), np.arange(0,1.2, 0.2), include_lowest=True).cat.categories

In [69]: pd.cut(df.a, np.append(np.arange(0, 1, 0.2), np.inf), labels=labels, include_lowest=True)
Out[69]:
0      (0.8, 1]
1      (0.8, 1]
2      [0, 0.2]
3    (0.6, 0.8]
4    (0.4, 0.6]
5      (0.8, 1]
6      (0.8, 1]
7    (0.6, 0.8]
8      (0.8, 1]
9    (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

说明：

In [72]: np.append(np.arange(0, 1, 0.2), np.inf)
Out[72]: array([ 0. ,  0.2,  0.4,  0.6,  0.8,  inf])

In [73]: labels
Out[73]: Index(['[0, 0.2]', '(0.2, 0.4]', '(0.4, 0.6]', '(0.6, 0.8]', '(0.8, 1]'], dtype='object')

解决方案 2： clip 所有大于 1 的值

In [70]: pd.cut(df.a.clip(upper=1), np.arange(0,1.2, 0.2),include_lowest=True)
Out[70]:
0      (0.8, 1]
1      (0.8, 1]
2      [0, 0.2]
3    (0.6, 0.8]
4    (0.4, 0.6]
5      (0.8, 1]
6      (0.8, 1]
7    (0.6, 0.8]
8      (0.8, 1]
9    (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

说明：

In [75]: df.a
Out[75]:
0    1.698479
1    0.903344
2    0.152001
3    0.621822
4    0.534484
5    0.980915
6    0.809973
7    0.677761
8    1.491537
9    0.294025
Name: a, dtype: float64

In [76]: df.a.clip(upper=1)
Out[76]:
0    1.000000
1    0.903344
2    0.152001
3    0.621822
4    0.534484
5    0.980915
6    0.809973
7    0.677761
8    1.000000
9    0.294025
Name: a, dtype: float64

【讨论】：

太棒了！使用解决方案 2。谢谢！
@eljusticiero67，不客气！ :) 如果回答了您的问题，请考虑accepting/upvoting 答案