【问题标题】:Discretization : converting continuous values into a certain number of categories离散化:将连续值转换为一定数量的类别
【发布时间】:2021-08-04 12:05:35
【问题描述】:
1   Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.

2   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

3   Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.

4   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

我的代码如下

df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))

输出如下:

               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1
               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1

但是-1925是错的……

正确的答案应该是这样的。

我该怎么办...

【问题讨论】:

    标签: python pandas dataframe bin discretization


    【解决方案1】:

    可能是第 1 行的拼写错误:df["Usage_Per_Year "]?列名末尾有一个空格。

    pd.cut 将值分成相等的大小。这就是为什么您所有的垃圾箱都具有相同的尺寸。看来您应该在 分箱后计算每个组的最小值和最大值。

    另外,要将值合并到相同的频率,您应该使用pd.qcut


    示例输入:

    import numpy as np
    import pandas as pd
    
    rng = np.random.default_rng(20210514)
    df = pd.DataFrame({
        'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
    })
    
    # 1
    group_label = ['Low', 'Medium', 'High']
    df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
                                  bins=3, labels=group_label)
    
    # 2
    print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
    
    # 3
    df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
                                   q=3, labels=group_label)
    
    # 4
    print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
    

    示例输出:

                   Miles_Driven_Per_Year              
                                   count    min    max
    Usage_Per_Year                                    
    Low                              878     31  20905
    Medium                           107  20955  41196
    High                              15  41991  62668
                   Miles_Driven_Per_Year              
                                   count    min    max
    Usage_Per_Year                                    
    Low                              334     31   4378
    Medium                           333   4449  11424
    High                             333  11442  62668
    

    【讨论】:

    • 非常感谢!
    猜你喜欢
    • 1970-01-01
    • 2020-11-05
    • 2011-10-24
    • 1970-01-01
    • 2021-02-15
    • 1970-01-01
    • 2013-04-15
    • 2021-09-19
    • 1970-01-01
    相关资源
    最近更新 更多