【问题标题】:Python Pandas: Create Groups by Range using mapPython Pandas:使用地图按范围创建组
【发布时间】:2016-12-04 05:05:34
【问题描述】:

我有一个大型数据集,我希望根据总数的累计百分比创建组。我已经通过使用 map 函数来实现这一点,请参见下面的代码。如果我想让我的小组更加细化,有没有更好的方法来做到这一点?因此,例如,现在正在查看 5% 的增量......如果想查看 1% 的增量怎么办。想知道是否有另一种方式我不必将它们显式输入到我的“codethem”函数中。

def codethem(dl):
if  dl < .05 : return '5'
elif .05 < dl <= .1: return '10'
elif .1 < dl <= .15: return '15'
elif .15 < dl <= .2: return '20'
elif .2 < dl <= .25: return '25'
elif .25 < dl <= .3: return '30'
elif  .3 < dl <= .35: return '35'
elif .35 < dl <= .4: return '40'
elif .4 < dl <= .45: return '45'
elif .45 < dl <= .5: return '50'
elif .5 < dl <= .55: return '55'
elif .55 < dl <= .6: return '60'
elif .6 < dl <= .65: return '65'
elif .65 < dl <= .7: return '70'
elif .7 < dl <= .75: return '75'
elif .75 < dl <= .8: return '80'
elif  .8 < dl <= .85: return '85'
elif .85 < dl <= .9: return '90'
elif .9 < dl <= .95: return '95'
elif .95 < dl <= 1: return '100'
else: return 'None'

my_df['code'] = my_df['sales_csum_aspercent'].map(code them)

谢谢!

【问题讨论】:

    标签: python-2.7 pandas dataframe


    【解决方案1】:

    有一种特殊的方法 - pd.cut()

    演示:

    创建随机 DF:

    In [393]: df = pd.DataFrame({'a': np.random.rand(10)})
    
    In [394]: df
    Out[394]:
              a
    0  0.860256
    1  0.399267
    2  0.209185
    3  0.773647
    4  0.294845
    5  0.883161
    6  0.985758
    7  0.559730
    8  0.723033
    9  0.126226
    

    我们应该在调用pd.cut()时指定bins:

    In [404]: np.linspace(0, 1, 11)
    Out[404]: array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])
    
    In [395]: pd.cut(df.a, bins=np.linspace(0, 1, 11))
    Out[395]:
    0    (0.8, 0.9]
    1    (0.3, 0.4]
    2    (0.2, 0.3]
    3    (0.7, 0.8]
    4    (0.2, 0.3]
    5    (0.8, 0.9]
    6      (0.9, 1]
    7    (0.5, 0.6]
    8    (0.7, 0.8]
    9    (0.1, 0.2]
    Name: a, dtype: category
    Categories (10, object): [(0, 0.1] < (0.1, 0.2] < (0.2, 0.3] < (0.3, 0.4] ... (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1]]
    

    如果我们想要自定义标签,我们应该明确指定它们:

    In [401]: bins = np.linspace(0,1, 11)
    

    注意:bin 标签必须比 bin 边缘的数量少一

    In [402]: labels = (bins[1:]*100).astype(int)
    
    In [412]: labels
    Out[412]: array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])
    
    In [403]: pd.cut(df.a, bins=bins, labels=labels)
    Out[403]:
    0     90
    1     40
    2     30
    3     80
    4     30
    5     90
    6    100
    7     60
    8     80
    9     20
    Name: a, dtype: category
    Categories (10, int64): [10 < 20 < 30 < 40 ... 70 < 80 < 90 < 100]
    

    让我们使用5% 步骤来完成

    In [419]: bins = np.linspace(0, 1, 21)
    
    In [420]: bins
    Out[420]: array([ 0.  ,  0.05,  0.1 ,  0.15,  0.2 ,  0.25,  0.3 ,  0.35,  0.4 ,  0.45,  0.5 ,  0.55,  0.6 ,  0.65,  0.7 ,  0.75,  0.8 ,  0.8
    5,  0.9 ,  0.95,  1.  ])
    
    In [421]: labels = (bins[1:]*100).astype(int)
    
    In [422]: labels
    Out[422]: array([  5,  10,  15,  20,  25,  30,  35,  40,  45,  50,  55,  60,  65,  70,  75,  80,  85,  90,  95, 100])
    
    In [423]: pd.cut(df.a, bins=bins, labels=labels)
    Out[423]:
    0     90
    1     40
    2     25
    3     80
    4     30
    5     90
    6    100
    7     60
    8     75
    9     15
    Name: a, dtype: category
    Categories (20, int64): [5 < 10 < 15 < 20 ... 85 < 90 < 95 < 100]
    

    【讨论】:

    • @esc,有帮助吗?
    猜你喜欢
    • 1970-01-01
    • 2022-06-29
    • 1970-01-01
    • 2021-10-02
    • 2016-08-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-27
    相关资源
    最近更新 更多