【发布时间】:2016-11-19 15:23:55
【问题描述】:
使用 Postgres 9.5。测试数据:
create temp table rental (
customer_id smallint
,rental_date timestamp without time zone
,customer_name text
);
insert into rental values
(1, '2006-05-01', 'james'),
(1, '2006-06-01', 'james'),
(1, '2006-07-01', 'james'),
(1, '2006-07-02', 'james'),
(2, '2006-05-02', 'jacinta'),
(2, '2006-05-03', 'jacinta'),
(3, '2006-05-04', 'juliet'),
(3, '2006-07-01', 'juliet'),
(4, '2006-05-03', 'julia'),
(4, '2006-06-01', 'julia'),
(5, '2006-05-05', 'john'),
(5, '2006-06-01', 'john'),
(5, '2006-07-01', 'john'),
(6, '2006-07-01', 'jacob'),
(7, '2006-07-02', 'jasmine'),
(7, '2006-07-04', 'jasmine');
我正在尝试了解现有客户的行为。我试图回答这个问题:
根据上一次下单的时间(当月、上月 (m-1)...到 m-12),客户再次下单的可能性有多大?
可能性计算如下:
distinct count of people who ordered in current month /
distinct count of people in their cohort.
因此,我需要生成一个表格,列出当月订购的人数,这些人数属于给定群组。
那么,加入队列的规则是什么?
- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc
我使用 DVD Store 数据库作为示例数据来开发查询:http://linux.dell.com/dvdstore/
以下是同类群组规则和聚合的示例,基于 7 月是
"month's orders being analysed"(请注意:"month's orders being analysed" 列是下面“所需输出”表中的第一列):
customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james | 1 1 | 1 | 1 | <- member of jul cohort, made order in jul
jasmine | 1 1 | | | <- member of jul cohort, made order in jul
jacob | 1 | | | <- member of jul cohort, did NOT make order in jul
john | 1 | 1 | 1 | <- member of jun cohort, made order in jul
julia | | 1 | 1 | <- member of jun cohort, did NOT make order in jul
juliet | 1 | | 1 | <- member of may cohort, made order in jul
jacinta | | | 1 1 | <- member of may cohort, did NOT make order in jul
此数据将输出下表:
--where m = month's orders being analysed
month's orders |how many people |how many people from |how many people |how many people from |how many people |how many people from |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16 |5 |1 | | | | |
jun-16 | | |5 |3 | | |
jul-16 |3 |2 |2 |1 |2 |1 |
到目前为止,我的尝试是在以下方面的变化:
generate_series()
和
row_number() over (partition by customer_id order by rental_id desc)
我还没能把所有东西都放在一起(我已经尝试了好几个小时,但还没有解决)。
为了可读性,我认为将我的工作分部分发布更好(如果有人希望我完整发布 sql 查询,请发表评论 - 我会添加它)。
系列查询:
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
排名查询:
(select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked
我想做一些类似的事情:对系列查询返回的每一行运行 orders_ranked 查询,然后根据 orders_ranked 的每次返回进行聚合。
类似:
(--this query counts the customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,
(--this query counts the customers in cohort m-1 who ordered in month m
select
count(distinct customer_id)
from
(--this query returns the orders by customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
【问题讨论】:
-
请提供准确的表定义以及相关列
rental_id、customer_id、rental_date的数据类型和所有约束。理想情况下是有效的CREATE TABLEstatement。 (我不会自己从链接的存档中提取它。)而且,总是,你的 Postgres 版本。 -
还有:
current month cohort: >1 order in month。我怀疑在您定义的所有实例中都应该是>=?请说清楚。还有:基数?每个客户和每月有多少订单:最小、最大、平均? -
@ErwinBrandstetter 我已经添加了测试数据(dvdrental db 的 .tar 文件和相关命令) - 我试图只转储表格,但遇到了问题 - 希望是我已经添加了足够的。你是对的,队列应该是
>=,我添加了一个表格,显示了队列规则的作用——希望这可以澄清事情。我已将 Postgres 版本添加到顶部 - 9.5。就数据量而言 - 数百万行。每月客户:数十万,每月平均订单 -
另外,你在 codementor 吗? :-)
-
将测试数据从 .tar 文件更改为插入命令。
标签: sql postgresql crosstab window-functions generate-series