离散化：将连续值转换为一定数量的类别答案

【问题标题】：Discretization : converting continuous values into a certain number of categories离散化：将连续值转换为一定数量的类别
【发布时间】：2021-08-04 12:05:35
【问题描述】：

1   Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.

2   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

3   Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.

4   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

我的代码如下

df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))

输出如下：

               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1
               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1

但是-1925是错的……

正确的答案应该是这样的。

我该怎么办...

【问题讨论】：

标签： python pandas dataframe bin discretization

【解决方案1】：

可能是第 1 行的拼写错误：df["Usage_Per_Year "]？列名末尾有一个空格。

pd.cut 将值分成相等的大小。这就是为什么您所有的垃圾箱都具有相同的尺寸。看来您应该在分箱后计算每个组的最小值和最大值。

另外，要将值合并到相同的频率，您应该使用pd.qcut。

示例输入：

import numpy as np
import pandas as pd

rng = np.random.default_rng(20210514)
df = pd.DataFrame({
    'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})

# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
                              bins=3, labels=group_label)

# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))

# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
                               q=3, labels=group_label)

# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))

示例输出：

               Miles_Driven_Per_Year              
                               count    min    max
Usage_Per_Year                                    
Low                              878     31  20905
Medium                           107  20955  41196
High                              15  41991  62668
               Miles_Driven_Per_Year              
                               count    min    max
Usage_Per_Year                                    
Low                              334     31   4378
Medium                           333   4449  11424
High                             333  11442  62668

【讨论】：

非常感谢！