从 pandas 数据框生成保留队列答案

【问题标题】：Generating a retention cohort from a pandas dataframe从 pandas 数据框生成保留队列
【发布时间】：2015-02-26 14:58:16
【问题描述】：

我有一个看起来像这样的熊猫数据框：

+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1      | 2015-01-25       |             0 | NaT        |
| ACC2      | 2015-01-11       |             0 | NaT        |
| ACC3      | 2015-01-18       |             0 | NaT        |
| ACC4      | 2014-12-21       |            14 | 2015-02-12 |
| ACC5      | 2014-12-21       |             5 | 2015-02-15 |
| ACC6      | 2014-12-21       |             0 | 2015-02-22 |
+-----------+------------------+---------------+------------+

它本质上是一个访问日志，因为它包含创建同类群组分析所需的所有数据。

每个注册周都是一个队列。要了解我可以使用的群组中有多少人：

visit_log.groupby('RegistrationWeek').AccountID.nunique()

我想做的是创建一个以注册周为键的数据透视表。列应该是 visit_weeks，值应该是每周访问次数超过 0 的唯一帐户 ID 的计数。

连同每个同类群组中的账户总数，我将能够显示百分比而不是绝对值。

最终产品看起来像这样：

+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1             | 70%         | 30%         | 20%         |
| week2             | 70%         | 30%         |             |
| week3             | 40%         |             |             |
+-------------------+-------------+-------------+-------------+

我尝试像这样旋转数据框：

visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')

但我还没有确定价值部分。我需要以某种方式计算帐户 ID，然后将总和除以上面的注册周聚合。

我是 pandas 的新手，所以如果这不是进行留存同类群组的最佳方式，请赐教！

谢谢

【问题讨论】：

您能否将您的 DataFrame 样本粘贴到有效的 HTML 表格中？这将允许其他人将其读入 pandas 以便 QA 他们对您的问题的回答。
试试这个link

标签： python pandas data-analysis retention

【解决方案1】：

您的问题有几个方面。

您可以利用现有数据构建什么

有several kinds of retention。为简单起见，我们只提及两个：

第 N 天保留：如果用户在第 0 天注册，她是否在第 N 天登录？（在第 N+1 天登录不会影响此指标）。要对其进行衡量，您需要跟踪用户的所有日志。
滚动保留：如果用户在第 0 天注册，她是在第 N 天还是在那之后的任何一天登录？（在第 N+1 天登录会影响此指标）。要衡量它，您只需要用户的最后知道日志。

如果我正确理解您的表格，您有两个相关变量来构建您的同类群组表格：注册日期和上次日志（访问周）。每周访问次数似乎无关紧要。

因此，您只能使用选项 2，滚动保留。

如何建表

首先，让我们构建一个虚拟数据集，以便我们有足够的工作量并且您可以重现它：

import pandas as pd
import numpy as np
import math
import datetime as dt

np.random.seed(0) # so that we all have the same results

def random_date(start, end,p=None):
    # Return a date randomly chosen between two dates
    if p is None:
        p = np.random.random()
    return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))

n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)

# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta 
start = end - relativedelta(years=1)

# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
                     index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))

所以现在我们应该有一些看起来像这样的东西：

users.head()

这里是一些构建同类群组的代码：

### Some useful functions
def add_weeks(sourcedate,weeks):
    return sourcedate + dt.timedelta(days=7*weeks)

def first_day_of_week(sourcedate):
    return sourcedate - dt.timedelta(days = sourcedate.weekday())

def last_day_of_week(sourcedate):
    return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))  

def retained_in_interval(users,signup_week,n_weeks,end_date):
    '''
        For a given list of users, returns the number of users 
        that signed up in the week of signup_week (the cohort)
        and that are retained after n_weeks
        end_date is just here to control that we do not un-necessarily fill the bottom right of the table
    '''
    # Define the span of the given week
    cohort_start       = first_day_of_week(signup_week)
    cohort_end         = last_day_of_week(signup_week)
    if n_weeks == 0:
        # If this is our first week, we just take the number of users that signed up on the given period of time
        return len( users[(users['signup_date'] >= cohort_start) 
                        & (users['signup_date'] <= cohort_end)])
    elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
        # If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
        # We return some easily recognizable date (not 0 as it would cause confusion)
        return float("Inf")
    else:
        # Otherwise, we count the number of users that signed up on the given period of time,
        # and whose last known log was later than the number of weeks added (rolling retention)
        return len( users[(users['signup_date'] >= cohort_start) 
                        & (users['signup_date'] <= cohort_end)
                        & pd.to_datetime((users['last_log'])    >=  pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
                        ])

这样我们就可以创建实际的函数了：

def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
    '''
        For a given dataframe of users, return a cohort table with the following parameters :
        cohort_number : the number of lines of the table
        period_number : the number of columns of the table
        cohort_span : the span of every period of time between the cohort (D, W, M)
        end_date = the date after which we stop counting the users
    '''
    # the last column of the table will end today :
    if end_date is None:
        end_date = dt.datetime.today()
    # The index of the dataframe will be a list of dates ranging
    dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)

    cohort = pd.DataFrame(columns=['Sign up'])
    cohort['Sign up'] = dates
    # We will compute the number of retained users, column-by-column
    #      (There probably is a more pythonesque way of doing it)
    range_dates = range(0,period_number+1)
    for p in range_dates:
        # Name of the column
        s_p = 'Week '+str(p)
        cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)

    cohort = cohort.set_index('Sign up')        
    # absolute values to percentage by dividing by the value of week 0 :
    cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
    return cohort

现在你可以调用它并查看结果：

cohort_table(users)

希望对你有帮助

【讨论】：

【解决方案2】：

使用来自 rom_j 答案的 users 数据的相同格式，这将更干净/更快，但只有在每周至少有一个注册/流失的情况下才有效。对于足够大的数据，这并不是一个糟糕的假设。

users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
                    .replace(0, pd.NaT)
                    .add(totals, axis=0)
                   )
retention = retention_counts.div(totals, axis=0)

realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)

【讨论】：