xingzheai

本文首发于:行者AI

关于用户留存是各大数据分析平台必不可少的功能,企业一般用留存率衡量用户的活跃情况,也是能直接反应产品功能价值的直接指标,留存率是衡量用户质量的最重要指标之一,因此计算各种留存率是数据分析取数的最底层的基本功。所以下面举几个用户留存分析的实战例子。

1. 准备

了解目前留存率几种常规计算方法、了解ClickHouse提供retention(cond1, cond2, …)函数计算留存率

建表:用户基本信息表:login_event

CREATE TABLE login_event --用户登录事件
(
    `accountId` String COMMENT \'账号的ID\', --用户唯一ID
    `ds` Date COMMENT \'日期\' --用户登录日期
)
ENGINE = MergeTree
PARTITION BY accountId
ORDER BY accountId

导数:插入8月份用户登录数据

--插入数据
insert into login_event values (10001,toDate(\'2020-08-01\'), (10001,toDate(\'2020-08-08\')), (10001,toDate(\'2020-08-09\')), (10001,toDate(\'2020-08-10\')), (10001,toDate(\'2020-08-12\')),
(10001,toDate(\'2020-08-13\')), (10001,toDate(\'2020-08-14\')), (10001,toDate(\'2020-08-15\')), (10001,toDate(\'2020-08-16\')), (10001,toDate(\'2020-08-17\')), (10001,toDate(\'2020-08-18\')),
(10001,toDate(\'2020-08-20\')), (10001,toDate(\'2020-08-22\')), (10001,toDate(\'2020-08-23\')), (10001,toDate(\'2020-08-24\')), (10002,toDate(\'2020-08-20\')), (10002,toDate(\'2020-08-22\')), (10002,toDate(\'2020-08-23\')), (10002,toDate(\'2020-08-01\')), (10002,toDate(\'2020-08-11\')), (10002,toDate(\'2020-08-12\')), (10002,toDate(\'2020-08-13\')), (10002,toDate(\'2020-08-20\')),
(10002,toDate(\'2020-08-15\')), (10002,toDate(\'2020-08-30\')), (10002,toDate(\'2020-08-20\')), (10002,toDate(\'2020-08-01\')), (10002,toDate(\'2020-08-06\')), (10002,toDate(\'2020-08-24\')), (10003,toDate(\'2020-08-05\')), (10003,toDate(\'2020-08-08\')), (10003,toDate(\'2020-08-09\')), (10003,toDate(\'2020-08-10\')), (10003,toDate(\'2020-08-11\')), (10003,toDate(\'2020-08-13\')),
(10003,toDate(\'2020-08-15\')), (10003,toDate(\'2020-08-16\')), (10003,toDate(\'2020-08-18\')), (10003,toDate(\'2020-08-20\')), (10003,toDate(\'2020-08-01\')), (10003,toDate(\'2020-08-21\')),
(10003,toDate(\'2020-08-22\')), (10003,toDate(\'2020-08-24\')), (10003,toDate(\'2020-08-26\')), (10003,toDate(\'2020-08-25\')), (10003,toDate(\'2020-08-27\')), (10003,toDate(\'2020-08-28\')),
(10003,toDate(\'2020-08-29\')), (10003,toDate(\'2020-08-30\')), (10004,toDate(\'2020-08-01\')), (10004,toDate(\'2020-08-02\')), (10004,toDate(\'2020-08-03\')), (10004,toDate(\'2020-08-04\')),
(10004,toDate(\'2020-08-05\')), (10004,toDate(\'2020-08-08\')), (10004,toDate(\'2020-08-09\')), (10004,toDate(\'2020-08-10\')), (10004,toDate(\'2020-08-11\')), (10004,toDate(\'2020-08-14\')),
(10004,toDate(\'2020-08-15\')), (10004,toDate(\'2020-08-16\')), (10004,toDate(\'2020-08-17\')), (10004,toDate(\'2020-08-19\')), (10004,toDate(\'2020-08-20\')), (10004,toDate(\'2020-08-21\')),
(10004,toDate(\'2020-08-22\')), (10004,toDate(\'2020-08-23\')), (10004,toDate(\'2020-08-24\')), (10004,toDate(\'2020-08-23\')),(10004,toDate(\'2020-08-23\')), (10004,toDate(\'2020-08-25\')),
(10004,toDate(\'2020-08-27\')), (10004,toDate(\'2020-08-30\'));

2. 题目分析

计算某日活跃用户的次留、3留、7留、14留、30留,我们将问题解决分为三个步骤:

  • 找到某日活跃用户

  • 找到某日活跃用户在第2、3、6、13、29日的登录情况

  • 计算某日活跃用户在第2、3、6、13、29日登录数,计算N日留存率

解决方法一:


--计算出2020-08-01活跃用户在第2、3、6、13、29日的留存数,计算出留存率
SELECT
    ds,
    count(accountIdD0) AS activeAccountNum,
    count(accountIdD1) / count(accountIdD0) AS `次留`,
    count(accountIdD3) / count(accountIdD0) AS `3留`,
    count(accountIdD7) / count(accountIdD0) AS `7留`,
    count(accountIdD14) / count(accountIdD0) AS `14留`,
    count(accountIdD30) / count(accountIdD0) AS `30留`
FROM
( --使用LEFT JOIN 找到2020-08-01当日活跃用户在第2、3、6、13、29日的登录用户
    SELECT DISTINCT
        a.ds AS ds,
        a.accountIdD0 AS accountIdD0,
        IF(b.accountId = \'\', NULL, b.accountId) AS accountIdD1,
        IF(c.accountId = \'\', NULL, c.accountId) AS accountIdD3,
        IF(d.accountId = \'\', NULL, d.accountId) AS accountIdD7,
        IF(e.accountId = \'\', NULL, e.accountId) AS accountIdD14,
        IF(f.accountId = \'\', NULL, f.accountId) AS accountIdD30
    FROM
    (--找出2020-08-01当日活跃用户
        SELECT DISTINCT
            ds,
            accountId AS accountIdD0
        FROM login_event
        WHERE ds = \'2020-08-01\'
        ORDER BY ds ASC
    ) AS a
    LEFT JOIN test.login3_event AS b ON (b.ds = addDays(a.ds, 1)) AND (a.accountIdD0 = b.accountId)
    LEFT JOIN test.login3_event AS c ON (c.ds = addDays(a.ds, 2)) AND (a.accountIdD0 = c.accountId)
    LEFT JOIN test.login3_event AS d ON (d.ds = addDays(a.ds, 6)) AND (a.accountIdD0 = d.accountId)
    LEFT JOIN test.login3_event AS e ON (e.ds = addDays(a.ds, 13)) AND (a.accountIdD0 = e.accountId)
    LEFT JOIN test.login3_event AS f ON (f.ds = addDays(a.ds, 29)) AND (a.accountIdD0 = f.accountId)
) AS temp
GROUP BY ds

结果:
-----------------------------------------
┌─────────ds─┬─activeAccountNum─┬─次留─┬──3留─┬─7留─┬─14留─┬─30留─┐
│ 2020-08-01 │                4 │ 0.25 │ 0.25 │   0 │  0.5 │ 0.75 │
└────────────┴──────────────────┴──────┴──────┴─────┴──────┴──────┘

1 rows in set. Elapsed: 0.022 sec.

解决方法二:


--判断2020-08-01活跃用户在第2、3、6、13、29日的留存数,计算出留存率,计算出留存率
SELECT DISTINCT
    b.ds AS ds,
    ifnull(countDistinct(if(a.ds = b.ds, a.accountId, NULL)), 0) AS activeAccountNum,
    ifnull(countDistinct(if(a.ds = addDays(b.ds, 1), b.accountId, NULL)) / activeAccountNum, 0) AS `次留`,
    ifnull(countDistinct(if(a.ds = addDays(b.ds, 2), b.accountId, NULL)) / activeAccountNum, 0) AS `3留`,
    ifnull(countDistinct(if(a.ds = addDays(b.ds, 6), b.accountId, NULL)) / activeAccountNum, 0) AS `7留`,
    ifnull(countDistinct(if(a.ds = addDays(b.ds, 13), b.accountId, NULL)) / activeAccountNum, 0) AS `14留`,
    ifnull(countDistinct(if(a.ds = addDays(b.ds, 29), b.accountId, NULL)) / activeAccountNum, 0) AS `30留`
FROM
  --使用INNER JOIN找出2020-08-01活跃用户在后续1~30日登录情况
(
    SELECT
        ds,
        accountId
    FROM login_event
    WHERE (ds <= addDays(toDate(\'2020-08-01\'), 29)) AND (ds >= \'2020-08-01\')
) AS a
INNER JOIN
--找出2020-08-01当日活跃用户
(
    SELECT DISTINCT
        accountId,
        ds
    FROM test.login3_event
    WHERE ds = \'2020-08-01\'
) AS b ON a.accountId = b.accountId
GROUP BY ds

结果:
-----------------------------------------
┌─────────ds─┬─activeAccountNum─┬─次留─┬──3留─┬─7留─┬─14留─┬─30留─┐
│ 2020-08-01 │                4 │ 0.25 │ 0.25 │   0 │  0.5 │ 0.75 │
└────────────┴──────────────────┴──────┴──────┴─────┴──────┴──────┘

1 rows in set. Elapsed: 0.019 sec.

解决方法三:


--根据数组下标SUM(r[index])获取2020-08-01活跃用户在第2、3、6、13、29日的留存数,计算出留存率
SELECT
    toDate(\'2020-08-01\') AS ds,
    SUM(r[1]) AS activeAccountNum,
    SUM(r[2]) / SUM(r[1]) AS `次留`,
    SUM(r[3]) / SUM(r[1]) AS `3留`,
    SUM(r[4]) / SUM(r[1]) AS `7留`,
    SUM(r[5]) / SUM(r[1]) AS `14留`,
    SUM(r[6]) / SUM(r[1]) AS `30留`   
FROM
--找到2020-08-01活跃用户在第2、3、6、13、29日的登录情况,1/0 => 登录/未登录
(
    WITH toDate(\'2020-08-01\') AS tt   
SELECT
    accountId,
    retention(
      toDate(ds) = tt, 
      toDate(subtractDays(ds, 1)) = tt, 
      toDate(subtractDays(ds, 2)) = tt, 
      toDate(subtractDays(ds, 6)) = tt,
      toDate(subtractDays(ds, 13)) = tt,
      toDate(subtractDays(ds, 29)) = tt
    ) AS r
  --找出2020-08-01活跃用户在后续1~30日登录数据
FROM login_event
WHERE (ds >= \'2020-08-01\') AND (ds <= addDays(toDate(\'2020-08-01\'), 29))
GROUP BY accountId
)
GROUP BY ds


结果:
-----------------------------------------
┌─────────ds─┬─activeAccountNum─┬─次留─┬──3留─┬─7留─┬─14留─┬─30留─┐
│ 2020-08-01 │                4 │ 0.25 │ 0.25 │   0 │  0.5 │ 0.75 │
└────────────┴──────────────────┴──────┴──────┴─────┴──────┴──────┘

1 rows in set. Elapsed: 0.009 sec.

3. 总结

  • 方法一,使用传统做法多表关联,了解ClickHouse的程序猿都清楚,多表关联是ClickHouse天敌,运行速度相对很慢。

  • 方法二,使用一个表关联,通过IF函数判断日期差值,找到所需日期用户数据,相对方法一减少了多表关联,提高了运行速度。

  • 方法三,使用ClickHouse自带retention函数,retention function是ClickHouse中高级聚合函数,该函数可以接受多个条件,以第一个条件结果为基准,后面各条件满足为1,不满足则为0,最后返回一个1和0组成的数组。通过统计数组中对应1的数量,既可计算出留存率。

三种计算方法比较而言,在海量的数据集下使用ClickHouse自带retention留存函数运行速度更快、更高效。提升了现有技术中用户留存率的计算方式速度慢效率低的问题,进而达到了提高计算速度和计算效率的效果。


PS:更多技术干货,快关注【公众号 | xingzhe_ai】,与行者一起讨论吧!

分类:

技术点:

相关文章:

  • 2021-12-04
  • 2021-03-30
  • 2021-10-30
  • 2021-10-23
  • 2021-09-21
  • 2021-12-01
  • 2021-11-16
  • 2021-11-23
猜你喜欢
  • 2021-12-15
  • 2021-11-13
  • 2021-09-19
  • 2021-11-23
  • 2021-10-08
  • 2021-09-08
  • 2021-10-09
相关资源
相似解决方案