【问题标题】:Rolling counts based on rolling cohorts基于滚动队列的滚动计数
【发布时间】:2016-11-19 15:23:55
【问题描述】:

使用 Postgres 9.5。测试数据:

create temp table rental (
    customer_id smallint
    ,rental_date timestamp without time zone
    ,customer_name text
);

insert into rental values
    (1, '2006-05-01', 'james'),
    (1, '2006-06-01', 'james'),
    (1, '2006-07-01', 'james'),
    (1, '2006-07-02', 'james'),
    (2, '2006-05-02', 'jacinta'),
    (2, '2006-05-03', 'jacinta'),
    (3, '2006-05-04', 'juliet'),
    (3, '2006-07-01', 'juliet'),
    (4, '2006-05-03', 'julia'),
    (4, '2006-06-01', 'julia'),
    (5, '2006-05-05', 'john'),
    (5, '2006-06-01', 'john'),
    (5, '2006-07-01', 'john'),
    (6, '2006-07-01', 'jacob'),
    (7, '2006-07-02', 'jasmine'),
    (7, '2006-07-04', 'jasmine');

我正在尝试了解现有客户的行为。我试图回答这个问题:

根据上一次下单的时间(当月、上月 (m-1)...到 m-12),客户再次下单的可能性有多大?

可能性计算如下:

distinct count of people who ordered in current month /
distinct count of people in their cohort.

因此,我需要生成一个表格,列出当月订购的人数,这些人数属于给定群组。

那么,加入队列的规则是什么?

- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc

我使用 DVD Store 数据库作为示例数据来开发查询:http://linux.dell.com/dvdstore/

以下是同类群组规则和聚合的示例,基于 7 月是 "month's orders being analysed"(请注意:"month's orders being analysed" 列是下面“所需输出”表中的第一列):

customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james       | 1  1  | 1     | 1     | <- member of jul cohort, made order in jul
jasmine     | 1  1  |       |       | <- member of jul cohort, made order in jul
jacob       | 1     |       |       | <- member of jul cohort, did NOT make order in jul
john        | 1     | 1     | 1     | <- member of jun cohort, made order in jul
julia       |       | 1     | 1     | <- member of jun cohort, did NOT make order in jul
juliet      | 1     |       | 1     | <- member of may cohort, made order in jul
jacinta     |       |       | 1 1   | <- member of may cohort, did NOT make order in jul

此数据将输出下表:

--where m = month's orders being analysed

month's orders |how many people |how many people from  |how many people   |how many people from    |how many people   |how many people from    |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16         |5               |1                     |                  |                        |                  |                        |
jun-16         |                |                      |5                 |3                       |                  |                        |
jul-16         |3               |2                     |2                 |1                       |2                 |1                       |

到目前为止,我的尝试是在以下方面的变化:

generate_series()

row_number() over (partition by customer_id order by rental_id desc)

我还没能把所有东西都放在一起(我已经尝试了好几个小时,但还没有解决)。

为了可读性,我认为将我的工作分部分发布更好(如果有人希望我完整发布 sql 查询,请发表评论 - 我会添加它)。

系列查询:

(select
    generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
    rental) as series

排名查询:

(select
    *,
    row_number() over (partition by customer_id order by rental_id desc) as rnk
from
    rental
where
    date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked

我想做一些类似的事情:对系列查询返回的每一行运行 orders_ranked 查询,然后根据 orders_ranked 的每次返回进行聚合。

类似:

(--this query counts the customers in cohort m-1
select
    count(distinct customer_id)
from
    (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
    select
        *,
        row_number() over (partition by customer_id order by rental_id desc) as rnk
    from
        rental
    where
        date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
    (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
    OR
    (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,


(--this query counts the customers in cohort m-1 who ordered in month m
select
    count(distinct customer_id)
from
    (--this query returns the orders by customers in cohort m-1
    select
        count(distinct customer_id)
    from
        (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
        select
            *,
            row_number() over (partition by customer_id order by rental_id desc) as rnk
        from
            rental
        where
            date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
    where
        (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
        OR
        (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
    rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
    (select
        generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
    from
        rental) as series

【问题讨论】:

  • 请提供准确的表定义以及相关列rental_idcustomer_idrental_date 的数据类型和所有约束。理想情况下是有效的CREATE TABLEstatement。 (我不会自己从链接的存档中提取它。)而且,总是,你的 Postgres 版本。
  • 还有:current month cohort: &gt;1 order in month。我怀疑在您定义的所有实例中都应该是 &gt;= ?请说清楚。还有:基数?每个客户和每月有多少订单:最小、最大、平均?
  • @ErwinBrandstetter 我已经添加了测试数据(dvdrental db 的 .tar 文件和相关命令) - 我试图只转储表格,但遇到了问题 - 希望是我已经添加了足够的。你是对的,队列应该是&gt;=,我添加了一个表格,显示了队列规则的作用——希望这可以澄清事情。我已将 Postgres 版本添加到顶部 - 9.5。就数据量而言 - 数百万行。每月客户:数十万,每月平均订单
  • 另外,你在 codementor 吗? :-)
  • 将测试数据从 .tar 文件更改为插入命令。

标签: sql postgresql crosstab window-functions generate-series


【解决方案1】:

这个查询可以做所有事情。它适用于整个表,适用于任何时间范围。

基于一些假设并假设当前 Postgres 版本为 9.5。至少应该与 pg 9.1 一起使用。由于我不清楚您对“队列”的定义,我跳过了“队列中有多少人”列。

我希望它比您迄今为止尝试的任何方法都快。按数量级。

SELECT *
FROM   crosstab (
   $$
   SELECT mon
        , sum(count(*)) OVER (PARTITION BY mon)::int AS m0
        , gap   -- count of months since last order
        , count(*) AS gap_ct
   FROM  (
      SELECT mon
           , mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
      FROM  (
         SELECT DISTINCT ON (1,2)
                date_trunc('month', rental_date)::date AS mon
              , customer_id                            AS c_id
              , extract(YEAR  FROM rental_date)::int * 12
              + extract(MONTH FROM rental_date)::int   AS mon_int
         FROM   rental
         ) dist_customer
      ) gap_to_last_month
   GROUP  BY mon, gap
   ORDER  BY mon, gap
   $$
 , 'SELECT generate_series(1,12)'
   ) ct (mon date, m0 int
       , m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
       , m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);

结果:

 星期一 |米0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+------+-----+-----+-----+-----+-- ---+-----+------+-----+------+-----+--
 2015-01-01 | 63 | 36 | 15 | 5 | 3 | 3 | | | | | | |
 2015-02-01 | 56 | 35 | 9 | 9 | 2 | | 1 | | | | | |
...

m0 .. 本月 >= 1 个订单的客户
m01 .. 本月 >= 1 个订单和 1 个月前 >= 1 个订单的客户(中间没有任何订单)
@987654330 @ .. 本月有 >= 1 个订单且 2 个月前有 >= 1 个订单且中间没有订单的客户
等等

如何?

  1. 在子查询中 dist_customer 减少到每月一行,customer_id (mon, c_id)DISTINCT ON

    为了简化以后的计算,添加日期的月数 (mon_int)。相关:

    如果每个(月,客户)有很多个订单,第一步有更快的查询技术:

  2. 在子查询gap_to_last_month 中添加列gap 指示本月与上个月同一客户的任何订单之间的时间间隔。为此使用窗口函数lag()。相关:

  3. 在外部SELECT 聚合每个(mon, gap) 以获得您所追求的计数。此外,获取m0的不同客户总数。

  4. 将此查询提供给crosstab(),以将结果转换为结果所需的表格形式。基础知识:

    关于“额外”栏目m0

【讨论】:

  • 感谢您发布此信息!我在过去的几个小时里浏览了这些链接,并将在接下来的几天里继续浏览它们,以确保我理解它们中的每一个。我已经更新了我的问题,这应该有助于让事情变得更清楚。就目前而言,发布的内容并不能回答问题。如果您有时间,如果您可以通过我所做的更新重新访问我的问题,那将是非常棒的。无论如何,我希望一旦我理解了您的查询,我就可以将其用作基础。再次感谢!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2011-11-06
  • 2020-08-10
  • 2021-08-28
  • 1970-01-01
  • 1970-01-01
  • 2012-12-27
相关资源
最近更新 更多