【问题标题】:Want to find Year on Year calculation using Groupby and apply for various years想使用 Groupby 查找 Year on Year 计算并申请不同年份
【发布时间】:2020-11-16 04:22:14
【问题描述】:

我有一个如下的数据框:

    MARKET     PRODUCT  TIMEPERIOD  DATE    VALUES
0   USA MARKET  APPLE   QUARTER 2020-06-01  100
1   USA MARKET  APPLE   YEARLY  2020-06-01  1000
2   USA MARKET  PEAR    QUARTER 2020-06-01  200
3   USA MARKET  PEAR    YEARLY  2020-06-01  5000
4   USA MARKET  APPLE   QUARTER 2019-06-01  300
5   USA MARKET  PEAR    YEARLY  2019-06-01  2000
6   USA MARKET  PEAR    QUARTER 2019-06-01  100
7   USA MARKET  PEAR    YEARLY  2019-06-01  3000
8   USA MARKET  APPLE   QUARTER 2018-06-01  300
9   USA MARKET  PEAR    YEARLY  2018-06-01  2000
10  USA MARKET  PEAR    QUARTER 2018-06-01  100
11  USA MARKET  PEAR    YEARLY  2018-06-01  3000
12  UK MARKET   WATERMELON  QUARTER 2020-06-01  200
13  UK MARKET   WATERMELON  YEARLY  2020-06-01  5000
14  UK MARKET   GRAPE   QUARTER 2020-06-01  200
15  UK MARKET   GRAPE   YEARLY  2020-06-01  5000
16  UK MARKET   WATERMELON  QUARTER 2019-06-01  500
17  UK MARKET   WATERMELON  YEARLY  2019-06-01  300
18  UK MARKET   GRAPE   QUARTER 2019-06-01  50
19  UK MARKET   GRAPE   YEARLY  2019-06-01  500
20  UK MARKET   WATERMELON  QUARTER 2018-06-01  500
21  UK MARKET   WATERMELON  YEARLY  2018-06-01  300
22  UK MARKET   GRAPE   QUARTER 2018-06-01  50
23  UK MARKET   GRAPE   YEARLY  2018-06-01  500

我想找出每个市场每个时间段的每个产品的年同比差异(那是一口!)例如,对于 TIMEPERIOD Quarter 期间 USA MARKET 的产品 APPLE,2020-06-01 的增长率是简单地说 (100-300)/300 = 66.6%,其中我使用 2020-06-01 减去 2019-06-01 除以 2019-06-01 的值。

如您所见,以下代码的问题在于它只返回了当年与过去一年的增长率。并且忽略了过去的 2019 年和 2018 年。我尝试了几个 if-else 块,但似乎都指向了一些错误,如果有任何巧妙的解决方案来解决这个问题,我将不胜感激。简而言之,我的growth_rate_prev 在这里没有使用(虽然我确实尝试过编织它但它失败了)。

def year_on_year(df):    
    try:
        curr_year_val = df[df['DATE']==max(df['DATE'])]['VALUES'].sum() 
        prev_year_val = df[df['DATE']==(max(df['DATE'])-relativedelta(months=12))]['VALUES'].sum()
        prev_prev_year_val = df[df['DATE']==(df(df['DATE'])-relativedelta(months=24))]['VALUES'].sum()
        
        growth_rate_curr = ((curr_year_val)-(prev_year_val))/(prev_year_val)
        growth_rate_prev = ((prev_year_val)-(prev_prev_year_val))/(prev_prev_year_val)
        
        
    except ZeroDivisionError:
        growth_rate_curr, growth_rate_prev = 0 , 0

        
    return growth_rate_curr


    
def product_growth(applied_group_df):            
        applied_group_df['Year on Year difference'] = year_on_year(applied_group_df)
        return applied_group_df

growth_rate_df = df_2.groupby(["TIMEPERIOD",'MARKET', 'PRODUCT']).apply(product_growth) 

如果有人想重现代码,您可以使用以下代码创建 df:

df_list_for_yoy = [['USA MARKET', 'APPLE', 'QUARTER', '2020-06-01', 100], ['USA MARKET', 'APPLE', 'YEARLY', '2020-06-01', 1000],
           ['USA MARKET', 'PEAR', 'QUARTER', '2020-06-01', 200],  ['USA MARKET', 'PEAR', 'YEARLY', '2020-06-01', 5000], 
           ['USA MARKET', 'APPLE', 'QUARTER', '2019-06-01', 300],  ['USA MARKET', 'APPLE', 'YEARLY', '2019-06-01', 2000],
           ['USA MARKET', 'PEAR', 'QUARTER', '2019-06-01', 100],  ['USA MARKET', 'PEAR', 'YEARLY', '2019-06-01', 3000],
           ['USA MARKET', 'APPLE', 'QUARTER', '2018-06-01', 300],  ['USA MARKET', 'APPLE', 'YEARLY', '2018-06-01', 2000],
           ['USA MARKET', 'PEAR', 'QUARTER', '2018-06-01', 100],  ['USA MARKET', 'PEAR', 'YEARLY', '2018-06-01', 3000],
           ['UK MARKET', 'WATERMELON', 'QUARTER', '2020-06-01', 200],  ['UK MARKET', 'WATERMELON', 'YEARLY', '2020-06-01', 5000], 
           ['UK MARKET', 'GRAPE', 'QUARTER', '2020-06-01', 200],    ['UK MARKET', 'GRAPE', 'YEARLY', '2020-06-01', 5000],
           ['UK MARKET', 'WATERMELON', 'QUARTER', '2019-06-01', 500],  ['UK MARKET', 'WATERMELON', 'YEARLY', '2019-06-01', 300], 
           ['UK MARKET', 'GRAPE', 'QUARTER', '2019-06-01', 50],    ['UK MARKET', 'GRAPE', 'YEARLY', '2019-06-01', 500],
           ['UK MARKET', 'WATERMELON', 'QUARTER', '2018-06-01', 500],  ['UK MARKET', 'WATERMELON', 'YEARLY', '2018-06-01', 300], 
           ['UK MARKET', 'GRAPE', 'QUARTER', '2018-06-01', 50],    ['UK MARKET', 'GRAPE', 'YEARLY', '2018-06-01', 500]]

column_names = ['MARKET', 'PRODUCT', 'TIMEPERIOD', 'DATE', 'VALUES']
df_2 = pd.DataFrame(df_list_for_yoy, columns = column_names)
df_2['DATE']= pd.to_datetime(df_2['DATE'])

【问题讨论】:

  • 请注意:(100-300)/300 等于约 66.6% 的“负增长”。
  • 我们应该假设数据框只有 2020、2019 和 2018 的值还是可以有更多?
  • @sharathnatraj 它可能有更多,在我的真实数据中它有到 2013 年

标签: python pandas dataframe group-by


【解决方案1】:

您可以使用itertools.combinations 来获得年-年组合,并在要应用于组的函数内部进行进一步操作,如下所示:

import numpy as np
import pandas as pd
from itertools import combinations

def get_annual_growth(grp):
    # Get all possible combination of the years from dataset
    year_comb_lists = np.sort([sorted(comb) for comb in combinations(grp.DATE.dt.year, 2)])
    # Remove those combinations in which difference is greather than 1 (for example, 2018-2020)
    year_comb_lists = year_comb_lists[(np.diff(year_comb_lists) == 1).flatten()] # comment this line if it's not the case
    # Get year-combination labels
    year_comb_strings = ['-'.join(map(str, comb)) for comb in year_comb_lists]
    
    # Create sub-dataframe with to be concated afterwards by pandas `groupby`
    subdf = pd.DataFrame(columns=['Annual Reference', 'Annual Growth (%)'])
    for i,years in enumerate(year_comb_lists): # for each year combination ...
        actual_value, last_value = grp[grp.DATE.dt.year==years[1]].VALUES.mean(), grp[grp.DATE.dt.year==years[0]].VALUES.mean()
        growth = (actual_value - last_value) / last_value # calculate the annual growth
        subdf.loc[i, :] = [year_comb_strings[i], growth] 
    return subdf

df_2.groupby(['TIMEPERIOD','MARKET', 'PRODUCT']).apply(get_annual_growth)

输出:

                                   Annual Reference Annual Growth (%)
TIMEPERIOD MARKET     PRODUCT                                        
QUARTER    UK MARKET  GRAPE      0        2019-2020               300
                                 1        2018-2019                 0
                      WATERMELON 0        2019-2020               -60
                                 1        2018-2019                 0
           USA MARKET APPLE      0        2019-2020            -66.67
                                 1        2018-2019                 0
                      PEAR       0        2019-2020               100
                                 1        2018-2019                 0
YEARLY     UK MARKET  GRAPE      0        2019-2020               900
                                 1        2018-2019                 0
                      WATERMELON 0        2019-2020           1566.67
                                 1        2018-2019                 0
           USA MARKET APPLE      0        2019-2020               -50
                                 1        2018-2019                 0
                      PEAR       0        2019-2020             66.67
                                 1        2018-2019                 0

【讨论】:

  • 谢谢,看起来也不错!永远不要使用组合中的东西
  • 这很好,因为您不必担心组合,它是可扩展的。顺便说一句,你的问题很有趣!哈哈
  • 是的,确实是一个有趣的问题,在 excel 中很容易完成的事情在 pandas 中却相当困难。实际上我还没有完全得到 0, 1, 0, 1 列
  • 澄清一下,您写VALUES.mean() 的部分,mean() 纯粹是为了将系列值转换为浮点值,对吧?
  • 感谢您的详细回复,我逐行浏览了您的代码,并在下面对自己的答案进行了一些更改。如果你愿意,请看看:)
【解决方案2】:

请找到这种方法。

df = df_2.groupby(['MARKET','TIMEPERIOD','PRODUCT'])['VALUES'].apply(list).reset_index()
def func(x):
    year = 2021
    for i in range(1,len(x['VALUES'])):
        colname = str(year-i) + '-Growth'
        x[colname] = round(abs(x['VALUES'][i]- x['VALUES'][i-1])/x['VALUES'][i]*100,2)
    return x
df = df.apply(lambda x: func(x), axis=1).drop('VALUES',axis=1)
print(df)

它是一个通用代码,应该适用于评论中提到的可追溯到 2013 年的所有前几年。

打印:

       MARKET TIMEPERIOD     PRODUCT  2020-Growth  2019-Growth
0   UK MARKET    QUARTER       GRAPE       300.00          0.0
1   UK MARKET    QUARTER  WATERMELON        60.00          0.0
2   UK MARKET     YEARLY       GRAPE       900.00          0.0
3   UK MARKET     YEARLY  WATERMELON      1566.67          0.0
4  USA MARKET    QUARTER       APPLE        66.67          0.0
5  USA MARKET    QUARTER        PEAR       100.00          0.0
6  USA MARKET     YEARLY       APPLE        50.00          0.0
7  USA MARKET     YEARLY        PEAR        66.67          0.0

解释:

首先,我对值进行分组并将其放入列表中:

df_2.groupby(['MARKET','TIMEPERIOD','PRODUCT'])['VALUES'].apply(list).reset_index()

例如

       MARKET TIMEPERIOD     PRODUCT              VALUES
0   UK MARKET    QUARTER       GRAPE       [200, 50, 50]
1   UK MARKET    QUARTER  WATERMELON     [200, 500, 500]
....

然后,我编写了一个应用来循环遍历“VALUES”列表列并进行增长计算。

【讨论】:

  • 看起来不错,但你认为我们也应该按市场分组
  • 是的,没错。我忘了你也想要“市场”这个。您可以按照您的提示将“市场”添加到 groupby。
【解决方案3】:

我对@9​​87654321@ 进行了一些更改以适应我的真实数据,其中一年中有不同的月份。可能有 2020-06-01、2020-03-01、2019-12-01 等,因此我必须进行以下更改才能获得相隔 1 年的日期组合对,即 [2019- 06-01, 2020-06-01], [2019-03-01, 2020-03-01], [2018-12-01, 2019-12-01] 等等等等。

import numpy as np
import pandas as pd
from itertools import combinations

def get_annual_growth(grp):
    # Get all possible combination of the years from dataset
    year_comb_lists = np.sort([sorted(comb) for comb in combinations(grp.DATE, 2)])
    new_year_comb_lists = [comb_dates for comb_dates in year_comb_lists if comb_dates[0]==comb_dates[1]-relativedelta(months=12)]

    # Get year-combination labels
    year_comb_strings = [comb[1] for comb in new_year_comb_lists]
    
    # Create sub-dataframe with to be concated afterwards by pandas `groupby`
    subdf = pd.DataFrame(columns=['Annual Reference', 'Annual Growth (%)'])
    for i,years in enumerate(new_year_comb_lists ): # for each year combination ...
        actual_value, last_value = grp[grp['Date']==years[1]].Values.mean(), grp[grp['Date']==years[0]].Values.mean()
        growth = (actual_value - last_value) / last_value # calculate the annual growth
        subdf.loc[i, :] = [year_comb_strings[i], growth] 
    return subdf

df_2.groupby(['TIMEPERIOD','MARKET', 'PRODUCT']).apply(get_annual_growth)
df_2= df_2.reset_index()
df_2['Annual_Reference'] = pd.to_datetime(df_2['Annual_Reference'])

【讨论】:

  • 好!尽管我认为您需要循环通过 new_year_comb_lists 而不是 year_comb_lists 才能工作,因为您基于此创建了 year_comb_strings
猜你喜欢
  • 1970-01-01
  • 2012-02-29
  • 2015-08-04
  • 1970-01-01
  • 2019-02-22
  • 2019-05-02
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多