【问题标题】:How to groupby multiple columns and create a new column in Python based on thresholds如何根据阈值在 Python 中对多列进行分组并创建新列
【发布时间】:2020-06-24 16:14:11
【问题描述】:

我有如下数据框

输入

Invoice No  Date    Text            Vendor    Days
1000001     1/1/2020    Rent Payment    A   0
1000003     2/1/2020    Rent Payment    A   1
1000005     4/1/2020    Rent Payment    A   2
1000007     6/1/2020    Water payment   A   2
1000008     9/2/2020    Rep Payment     A   34
1000010     9/2/2020    Car Payment     A   0
1000011     10/2/2020   Car Payment     A   1
1000012     15/2/2020   Car Payment     A   5
1000013     16/2/2020   Car Payment     A   1
1000015     17/2/2020   Car Payment     A   1
1000002     1/1/2020    Rent Payment    B   -47
1000004     4/1/2020    Con Payment     B   3
1000006     6/1/2020    Con Payment     B   2
1000009     9/2/2020    Water payment   B   34
1000014    17/2/2020    Test Payment    B   8
1000016    19/2/2020    Test Payment    B   2

条件

如何编写检查描述、供应商名称和天数列的python条件,如果描述、供应商名称相同且天数为

预期输出

Invoice No  Date        Text          Vendor   Days    Group
1000001     1/1/2020    Rent Payment    A       0        G1
1000003     2/1/2020    Rent Payment    A       1        G1
1000005     4/1/2020    Rent Payment    A       2        G1
1000007     6/1/2020    Water payment   A       2        G2
1000008     9/2/2020    Rep Payment     A       34       G3
1000010     9/2/2020    Car Payment     A       0        G4
1000011    10/2/2020    Car Payment     A       1        G4
1000012    15/2/2020    Car Payment     A       5        G5
1000013    16/2/2020    Car Payment     A       1        G5
1000015    17/2/2020    Car Payment     A       1        G5
1000002    1/1/2020     Rent Payment    B      -47       G6
1000004    4/1/2020     Con Payment     B       3        G7
1000006    6/1/2020     Con Payment     B       2        G7
1000009    9/2/2020     Water payment   B      34        G8
1000014    17/2/2020    Test Payment    B       8        G9
1000016    19/2/2020    Test Payment    B       2        G9

【问题讨论】:

    标签: python python-3.x pandas dataframe group-by


    【解决方案1】:

    您需要在三个项目上使用groupby'Text''Vendor',以及在仅由['Text', 'Vendor'] 定义的组中'Days' 的变化是否超过2 的布尔表示。

    之后,您需要命名唯一组。下面我提供了两种方法。

    ngroup

    f = lambda x: x.diff().fillna(0).gt(2).cumsum()
    d = df.groupby(['Text', 'Vendor']).Days.transform(f)
    g = df.groupby(['Text', 'Vendor', d], sort=False).ngroup()
    df.assign(Group=g.add(1).astype(str).radd('G'))
    
        Invoice No       Date           Text Vendor  Days Group
    0      1000001   1/1/2020   Rent Payment      A     0    G1
    1      1000003   2/1/2020   Rent Payment      A     1    G1
    2      1000005   4/1/2020   Rent Payment      A     2    G1
    3      1000007   6/1/2020  Water payment      A     2    G2
    4      1000008   9/2/2020    Rep Payment      A    34    G3
    5      1000010   9/2/2020    Car Payment      A     0    G4
    6      1000011  10/2/2020    Car Payment      A     1    G4
    7      1000012  15/2/2020    Car Payment      A     5    G5
    8      1000013  16/2/2020    Car Payment      A     1    G5
    9      1000015  17/2/2020    Car Payment      A     1    G5
    10     1000002   1/1/2020   Rent Payment      B   -47    G6
    11     1000004   4/1/2020    Con Payment      B     3    G7
    12     1000006   6/1/2020    Con Payment      B     2    G7
    13     1000009   9/2/2020  Water payment      B    34    G8
    14     1000014  17/2/2020   Test Payment      B     8    G9
    15     1000016  19/2/2020   Test Payment      B     2    G9
    

    factorize

    f = lambda x: x.diff().fillna(0).gt(2).cumsum()
    d = df.groupby(['Text', 'Vendor']).Days.transform(f)
    g = pd.factorize([*zip(df.Text, df.Vendor, d)])[0]
    df.assign(Group=[f'G{i + 1}' for i in g])
    
        Invoice No       Date           Text Vendor  Days Group
    0      1000001   1/1/2020   Rent Payment      A     0    G1
    1      1000003   2/1/2020   Rent Payment      A     1    G1
    2      1000005   4/1/2020   Rent Payment      A     2    G1
    3      1000007   6/1/2020  Water payment      A     2    G2
    4      1000008   9/2/2020    Rep Payment      A    34    G3
    5      1000010   9/2/2020    Car Payment      A     0    G4
    6      1000011  10/2/2020    Car Payment      A     1    G4
    7      1000012  15/2/2020    Car Payment      A     5    G5
    8      1000013  16/2/2020    Car Payment      A     1    G5
    9      1000015  17/2/2020    Car Payment      A     1    G5
    10     1000002   1/1/2020   Rent Payment      B   -47    G6
    11     1000004   4/1/2020    Con Payment      B     3    G7
    12     1000006   6/1/2020    Con Payment      B     2    G7
    13     1000009   9/2/2020  Water payment      B    34    G8
    14     1000014  17/2/2020   Test Payment      B     8    G9
    15     1000016  19/2/2020   Test Payment      B     2    G9
    

    一些细节

    #        The first element of group    Cumulatively summing True/False
    #        will get NaN so we fill it    will create a new value every time
    #        in with 0         ║           we see a True.  This creates groups
    #                          ║               ║     
    #         adjacent differences   Should be obvious
    #               ╭─┴──╮ ╭───╨───╮ ╭─┴─╮ ╭───╨──╮
    f = lambda x: x.diff().fillna(0).gt(2).cumsum()
    

    【讨论】:

    • @piSquared,我提供的预期输入和输出中存在小错误,刚刚更正。
    • 我的意思是要求检查供应商和描述以及天列,如果供应商和描述相同并且相邻行之间的天差为
    【解决方案2】:

    您可以将您的条件组合成groupby 并使用ngroup

    df['Group'] = df['Group'] = (df.groupby([df['Description'].ne(df['Description'].shift()).cumsum(), 
                                 df['Vendor'].ne(df['Vendor'].shift()).cumsum(), 
                                 df['Days']<=2]).ngroup()+1)
                                .astype(str).str.pad(2, 'left','G') 
    
    # same description : df['Description'].ne(df['Description'].shift()).cumsum()
    # same vendor : df['Vendor'].ne(df['Vendor'].shift()).cumsum()
    # Days<=2 : df['Days']<=2
    

    输出:

        Invoice No  Date    Description Vendor  Days    Group
    0   123456  2020-01-01  Rent Payment    A   0   G1
    1   123457  2020-02-01  Rent Payment    A   1   G1
    2   123458  2020-04-01  Rent Payment    A   2   G1
    3   123459  2020-06-01  Water Payment   A   2   G2
    4   123460  2020-09-02  Rent Payment    A   34  G3
    5   123461  2020-09-02  Rep Payment     A   0   G4
    6   123462  2020-10-02  Rep Payment     A   1   G4
    7   123463  2020-11-02  Rep Payment     A   2   G4
    8   123464  2020-02-20  Water Payment   A   11  G5
    

    【讨论】:

    • 到第 4 行是它的 aggining 正确,从第 5 行开始它不正确。
    • 你为什么把Rent PaymentRep Payment当作同一个Description? @Rahulrajan
    • 抱歉打错了,我的意思是要求检查供应商和描述以及天列,如果供应商和描述相同并且相邻行之间的天差是
    • 这正是它的作用。 @Rahulrajan 在您的示例中似乎有一个错误。 Rent PaymentRep Payment 在同一个组中,即使不一样。
    猜你喜欢
    • 2021-11-19
    • 2021-10-19
    • 2014-08-14
    • 2023-04-05
    • 1970-01-01
    • 2021-05-15
    • 1970-01-01
    • 2016-07-01
    • 1970-01-01
    相关资源
    最近更新 更多