基于滚动队列的滚动计数答案

【问题标题】：Rolling counts based on rolling cohorts基于滚动队列的滚动计数
【发布时间】：2016-11-19 15:23:55
【问题描述】：

使用 Postgres 9.5。测试数据：

create temp table rental (
    customer_id smallint
    ,rental_date timestamp without time zone
    ,customer_name text
);

insert into rental values
    (1, '2006-05-01', 'james'),
    (1, '2006-06-01', 'james'),
    (1, '2006-07-01', 'james'),
    (1, '2006-07-02', 'james'),
    (2, '2006-05-02', 'jacinta'),
    (2, '2006-05-03', 'jacinta'),
    (3, '2006-05-04', 'juliet'),
    (3, '2006-07-01', 'juliet'),
    (4, '2006-05-03', 'julia'),
    (4, '2006-06-01', 'julia'),
    (5, '2006-05-05', 'john'),
    (5, '2006-06-01', 'john'),
    (5, '2006-07-01', 'john'),
    (6, '2006-07-01', 'jacob'),
    (7, '2006-07-02', 'jasmine'),
    (7, '2006-07-04', 'jasmine');

我正在尝试了解现有客户的行为。我试图回答这个问题：

根据上一次下单的时间（当月、上月 (m-1)...到 m-12），客户再次下单的可能性有多大？

可能性计算如下：

distinct count of people who ordered in current month /
distinct count of people in their cohort.

因此，我需要生成一个表格，列出当月订购的人数，这些人数属于给定群组。

那么，加入队列的规则是什么？

- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc

我使用 DVD Store 数据库作为示例数据来开发查询：http://linux.dell.com/dvdstore/

以下是同类群组规则和聚合的示例，基于 7 月是 "month's orders being analysed"（请注意："month's orders being analysed" 列是下面“所需输出”表中的第一列）：

customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james       | 1  1  | 1     | 1     | <- member of jul cohort, made order in jul
jasmine     | 1  1  |       |       | <- member of jul cohort, made order in jul
jacob       | 1     |       |       | <- member of jul cohort, did NOT make order in jul
john        | 1     | 1     | 1     | <- member of jun cohort, made order in jul
julia       |       | 1     | 1     | <- member of jun cohort, did NOT make order in jul
juliet      | 1     |       | 1     | <- member of may cohort, made order in jul
jacinta     |       |       | 1 1   | <- member of may cohort, did NOT make order in jul

此数据将输出下表：

--where m = month's orders being analysed

month's orders |how many people |how many people from  |how many people   |how many people from    |how many people   |how many people from    |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16         |5               |1                     |                  |                        |                  |                        |
jun-16         |                |                      |5                 |3                       |                  |                        |
jul-16         |3               |2                     |2                 |1                       |2                 |1                       |

到目前为止，我的尝试是在以下方面的变化：

generate_series()

和

row_number() over (partition by customer_id order by rental_id desc)

我还没能把所有东西都放在一起（我已经尝试了好几个小时，但还没有解决）。

为了可读性，我认为将我的工作分部分发布更好（如果有人希望我完整发布 sql 查询，请发表评论 - 我会添加它）。

系列查询：

(select
    generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
    rental) as series

排名查询：

(select
    *,
    row_number() over (partition by customer_id order by rental_id desc) as rnk
from
    rental
where
    date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked

我想做一些类似的事情：对系列查询返回的每一行运行 orders_ranked 查询，然后根据 orders_ranked 的每次返回进行聚合。

类似：

(--this query counts the customers in cohort m-1
select
    count(distinct customer_id)
from
    (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
    select
        *,
        row_number() over (partition by customer_id order by rental_id desc) as rnk
    from
        rental
    where
        date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
    (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
    OR
    (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,


(--this query counts the customers in cohort m-1 who ordered in month m
select
    count(distinct customer_id)
from
    (--this query returns the orders by customers in cohort m-1
    select
        count(distinct customer_id)
    from
        (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
        select
            *,
            row_number() over (partition by customer_id order by rental_id desc) as rnk
        from
            rental
        where
            date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
    where
        (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
        OR
        (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
    rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
    (select
        generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
    from
        rental) as series

【问题讨论】：

请提供准确的表定义以及相关列rental_id、customer_id、rental_date 的数据类型和所有约束。理想情况下是有效的CREATE TABLEstatement。（我不会自己从链接的存档中提取它。）而且，总是，你的 Postgres 版本。
还有：current month cohort: >1 order in month。我怀疑在您定义的所有实例中都应该是 >= ？请说清楚。还有：基数？每个客户和每月有多少订单：最小、最大、平均？
@ErwinBrandstetter 我已经添加了测试数据（dvdrental db 的 .tar 文件和相关命令） - 我试图只转储表格，但遇到了问题 - 希望是我已经添加了足够的。你是对的，队列应该是>=，我添加了一个表格，显示了队列规则的作用——希望这可以澄清事情。我已将 Postgres 版本添加到顶部 - 9.5。就数据量而言 - 数百万行。每月客户：数十万，每月平均订单
另外，你在 codementor 吗？ :-)
将测试数据从 .tar 文件更改为插入命令。

标签： sql postgresql crosstab window-functions generate-series

【解决方案1】：

这个查询可以做所有事情。它适用于整个表，适用于任何时间范围。

基于一些假设并假设当前 Postgres 版本为 9.5。至少应该与 pg 9.1 一起使用。由于我不清楚您对“队列”的定义，我跳过了“队列中有多少人”列。

我希望它比您迄今为止尝试的任何方法都快。按数量级。

SELECT *
FROM   crosstab (
   $$
   SELECT mon
        , sum(count(*)) OVER (PARTITION BY mon)::int AS m0
        , gap   -- count of months since last order
        , count(*) AS gap_ct
   FROM  (
      SELECT mon
           , mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
      FROM  (
         SELECT DISTINCT ON (1,2)
                date_trunc('month', rental_date)::date AS mon
              , customer_id                            AS c_id
              , extract(YEAR  FROM rental_date)::int * 12
              + extract(MONTH FROM rental_date)::int   AS mon_int
         FROM   rental
         ) dist_customer
      ) gap_to_last_month
   GROUP  BY mon, gap
   ORDER  BY mon, gap
   $$
 , 'SELECT generate_series(1,12)'
   ) ct (mon date, m0 int
       , m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
       , m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);

结果：

 星期一 |米0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+------+-----+-----+-----+-----+-- ---+-----+------+-----+------+-----+--
 2015-01-01 | 63 | 36 | 15 | 5 | 3 | 3 | | | | | | |
 2015-02-01 | 56 | 35 | 9 | 9 | 2 | | 1 | | | | | |
...

m0 .. 本月 >= 1 个订单的客户
m01 .. 本月 >= 1 个订单和 1 个月前 >= 1 个订单的客户（中间没有任何订单）
@987654330 @ .. 本月有 >= 1 个订单且 2 个月前有 >= 1 个订单且中间没有订单的客户
等等

如何？

在子查询中 dist_customer 减少到每月一行，customer_id (mon, c_id) 和 DISTINCT ON：
- Select first row in each GROUP BY group?
为了简化以后的计算，添加日期的月数 (mon_int)。相关：
- How do you do date math that ignores the year?
如果每个（月，客户）有很多个订单，第一步有更快的查询技术：
- Optimize GROUP BY query to retrieve latest record per user
在子查询gap_to_last_month 中添加列gap 指示本月与上个月同一客户的任何订单之间的时间间隔。为此使用窗口函数lag()。相关：
- PostgreSQL window function: partition by comparison
在外部SELECT 聚合每个(mon, gap) 以获得您所追求的计数。此外，获取本月m0的不同客户总数。
将此查询提供给crosstab()，以将结果转换为结果所需的表格形式。基础知识：
- PostgreSQL Crosstab Query
关于“额外”栏目m0：
- Pivot on Multiple Columns using Tablefunc

【讨论】：

感谢您发布此信息！我在过去的几个小时里浏览了这些链接，并将在接下来的几天里继续浏览它们，以确保我理解它们中的每一个。我已经更新了我的问题，这应该有助于让事情变得更清楚。就目前而言，发布的内容并不能回答问题。如果您有时间，如果您可以通过我所做的更新重新访问我的问题，那将是非常棒的。无论如何，我希望一旦我理解了您的查询，我就可以将其用作基础。再次感谢！