【问题标题】:Pandas shift on quarterly data with missing quartersPandas 在季度数据缺失的情况下转移季度数据
【发布时间】:2021-06-12 23:59:21
【问题描述】:

我知道这里有一些类似的问题,但请继续阅读,因为我已经查看并尝试调整现有解决方案,但没有任何运气。我有一个数据框,可以提取年份和季度的数据。在下面显示的场景中,prevYearLeadCount 显示 2020 年第一季度的数据。要清楚 prevYearLeadCount 将始终显示上一年同一季度的潜在客户数量。下面只是一个示例,展示了数据的结构。另外,看看下面的数据,因为有 2019 年第四季度的数据,我预计 2020 年第四季度 prevYearLeadCount 等于 236

[
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2017,
        "quarter": 2,
        "leadCount": 151,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 2,
        "leadCount": 73,
        "prevYearLeadCount": 151.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 3,
        "leadCount": 271,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 4,
        "leadCount": 173,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 1,
        "leadCount": 209,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 2,
        "leadCount": 274,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 3,
        "leadCount": 311,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 4,
        "leadCount": 236,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 1,
        "leadCount": 245,
        "prevYearLeadCount": 209.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 2,
        "leadCount": 430,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 3,
        "leadCount": 907,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 4,
        "leadCount": 657,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2021,
        "quarter": 1,
        "leadCount": 609,
        "prevYearLeadCount": 245.0
    }
]

查看上面的数据,我预计 2020 年将如下所示:

{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 1,
    "leadCount": 209,
    "prevYearLeadCount": 209.0
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 2,
    "leadCount": 430,
    "prevYearLeadCount": 274
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 3,
    "leadCount": 907,
    "prevYearLeadCount": 311
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 4,
    "leadCount": 657,
    "prevYearLeadCount": 236 
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2021,
    "quarter": 1,
    "leadCount": 609,
    "prevYearLeadCount": 245.0
}

正如here 所见,我尝试了以下方法:

df['prev_year_lead_count'] = df.groupby("quarter").lead_count.shift()[ (df.year == df.year.shift() + 1) ]

这很接近,因为我在某些情况下得到了我所期望的,但不是全部。在某些框架中,我看到我应该在上一年和上一季度肯定存在数据的 0。我正在尝试完全按照here 所见,但每年都分为几个季度。

我尝试过的另一件事是将 python 和 pandas 结合起来。这里的想法是遍历框架中的现有年份,并检查前一年以查看该季度是否存在。如果是这样,那就做熊猫吧。

qs = [1, 2, 3, 4]
for year in leads_df["year"].unique():
    df = leads_df[leads_df["year"] == year - 1]
    for q in qs:
        if q in df["quarter"]:
            leads_df["prev_year_lead_count"] = leads_df.groupby("quarter")["lead_count"].shift(+1)
            leads_df["prev_year_cost"] = leads_df.groupby("quarter")["cost"].shift(+1)
            leads_df["prev_year_ga_spent"] = leads_df.groupby("quarter")["ga_spent"].shift(+1)
            leads_df["prev_year_fb_spent"] = leads_df.groupby("quarter")["fb_spent"].shift(+1)
            leads_df["prev_year_monthly_package_cost"] = leads_df.groupby("quarter")[
                "monthly_package_cost"
            ].shift(+1)
            leads_df["prev_year_cpl"] = leads_df.groupby("quarter")["cpl"].shift(+1)

【问题讨论】:

  • 您发布的示例 df 不适用于解决方案,因为 df 只有一行。 Shift 从前一行或后一行调用一个 val,因此需要不止一行才能工作。
  • 抱歉,这只是数据结构的一个示例。
  • 不用担心。但是,如果您可以提供一些更全面的示例数据,那么重现问题并可能提供解决方案会容易得多。
  • 好了!所以多一点背景——我总是会提取 4 年的数据。问题是客户来来去去。所以我总是会在这里和那里错过四分之一。
  • 另外,查看上面的数据,因为有 2019 年第四季度的数据,我预计 2020 年第四季度 prevYearLeadCount 等于 236

标签: python pandas


【解决方案1】:

修复仅检查上一年

为了解决一年的差异,我们需要欺骗 groupby。以下是操作方法。

import pandas as pd

df = pd.DataFrame(d)

#find difference between years for each quarter

df['yeardiff'] = df.groupby(['quarter'])['year'].transform(lambda x: x - x.shift())

#create a condition to pick only NaN and difference of 1 year
#this will eliminate 2 years or more

cond = (df['yeardiff'].isnull() | (df['yeardiff'] == 1.0))

#use this condition while doing the groupby
#If condition not met, it will default to NaN

df['newprevYearLeadCount'] = df[cond].groupby(['quarter'])['leadCount'].transform(lambda x: x.shift())

print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])

结果如下:

我删除了 2020 年第 1 季度的条目。所以 2021 年第 1 季度应该是 NaN。

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        2        274                0.0                  73.0
6   2019        3        311                0.0                 271.0
7   2019        4        236                0.0                 173.0
8   2020        2        430                0.0                 274.0
9   2020        3        907                0.0                 311.0
10  2020        4        657                0.0                 236.0
11  2021        1        609              245.0                   NaN  #prev year LeadCount ignored

另一个排除 2019 年第 2 季度的示例:

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        3        311                0.0                 271.0
6   2019        4        236                0.0                 173.0
7   2020        1        245              209.0                 209.0
8   2020        2        430                0.0                   NaN  #prev year LeadCount ignored
9   2020        3        907                0.0                 311.0
10  2020        4        657                0.0                 236.0
11  2021        1        609              245.0                 245.0

上一个答案

您应该可以groupby(['quarter'],然后执行shift() 以获得结果。

import pandas as pd
df = pd.DataFrame(d)
#df.sort_values(by=['quarter','year'],inplace=True)
#df.reset_index(drop=True,inplace=True)
df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift())
print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])

这个输出将是:

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        2        274                0.0                  73.0
6   2019        3        311                0.0                 271.0
7   2019        4        236                0.0                 173.0
8   2020        1        245              209.0                 209.0
9   2020        2        430                0.0                 274.0
10  2020        3        907                0.0                 311.0
11  2020        4        657                0.0                 236.0
12  2021        1        609              245.0                 245.0

最初我打算按quarter 对值进行排序,然后按year 进行排序,但 groupby 会处理它。所以你只需要groupby。 transform 负责将值分配给每一行。

如果没有记录可以选择上一年的线索计数,则该值设置为 NaN。您可以决定fillna(0),这样它将被替换为 0.0。

如果您需要 0 而不是 NaN,请执行以下操作:

df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift()).fillna(0)

【讨论】:

  • 知道如何解释丢失的数据年吗?如果您有 2019 年的第 1 期和 2021 年的第 1 期,则您将获得上一年数据的 2019 年数据。
  • 嗯。是的,它将选择 2019 年的数据作为 prevYear。让我解决这个问题。
  • 必须做一些挖掘才能做到这一点。 @00robinette,请参阅我的更新答案。我认为这应该可以解决问题
猜你喜欢
  • 2018-04-26
  • 2020-08-28
  • 2018-07-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-08-01
  • 2023-03-14
  • 2022-11-10
相关资源
最近更新 更多