【问题标题】:Select distinct users group by time range按时间范围选择不同的用户组
【发布时间】:2013-04-09 16:08:48
【问题描述】:

我有一张包含以下信息的表格

 |date | user_id | week_beg | month_beg|

使用测试值创建表的 SQL:

CREATE TABLE uniques
(
  date DATE,
  user_id INT,
  week_beg DATE,
  month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01') 

输入表:

 | date       | user_id     | week_beg   | month_beg  |    
 | 2013-01-01 | 1           | 2012-12-30 | 2013-01-01 |    
 | 2013-01-03 | 3           | 2012-12-30 | 2013-01-01 |    
 | 2013-01-06 | 4           | 2013-01-06 | 2013-01-01 |    
 | 2013-01-07 | 4           | 2013-01-06 | 2013-01-01 |  

输出表:

 | date       | time_series | cnt        |                 
 | 2013-01-01 | D           | 1          |                 
 | 2013-01-01 | W           | 1          |                 
 | 2013-01-01 | M           | 1          |                 
 | 2013-01-03 | D           | 1          |                 
 | 2013-01-03 | W           | 2          |                 
 | 2013-01-03 | M           | 2          |                 
 | 2013-01-06 | D           | 1          |                 
 | 2013-01-06 | W           | 1          |                 
 | 2013-01-06 | M           | 3          |                 
 | 2013-01-07 | D           | 1          |                 
 | 2013-01-07 | W           | 1          |                 
 | 2013-01-07 | M           | 3          |

我想计算一个日期的不同 user_id 的数量:

  1. 在那一天

  2. 截至该日期的那一周(Week to date)

  3. 截至该日期的月份(Month to date)

1 很容易计算。 对于 2 和 3,我正在尝试使用此类查询:

SELECT
  date,
  'W' AS "time_series",
  (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
  FROM user_subtitles

SELECT
  date,
  'M' AS "time_series",
  (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
  FROM user_subtitles

Postgres 不允许用于 DISTINCT 计算的窗口函数,因此这种方法不起作用。

我也尝试了 GROUP BY 方法,但它不起作用,因为它给了我整周/月的数字。

解决这个问题的最佳方法是什么?

【问题讨论】:

  • 请分享一些输入数据及其预期输出
  • @Akash 我添加了信息,谢谢

标签: sql postgresql date correlated-subquery window-functions


【解决方案1】:

计数所有

SELECT date, '1_D' AS time_series,  count(DISTINCT user_id) AS cnt
FROM   uniques
GROUP  BY 1

UNION  ALL
SELECT DISTINCT ON (1)
       date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM   uniques

UNION  ALL
SELECT DISTINCT ON (1)
       date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM   uniques
ORDER  BY 1, time_series
  • 您的列 week_begmonth_beg 是 100% 冗余的,可以很容易地替换为 分别为date_trunc('week', date + 1) - 1date_trunc('month', date)

  • 您的一周似乎从星期日开始(减一),因此+ 1 .. - 1

  • default frame of a window functionORDER BYOVER 子句中使用的是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。这正是您所需要的。

  • 使用UNION ALL,而不是UNION

  • 你对time_series(D、W、M)的不幸选择排序不好,我重命名以使最终的ORDER BY更容易。

  • 此查询每天可以处理多行。计数包括一天内的所有同龄人。

  • 更多关于DISTINCT ON

每天有不同的用户

要每天只计算每个用户一次,请使用 CTEDISTINCT ON

WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series,  count(user_id) AS cnt
FROM   x
GROUP  BY 1

UNION ALL
SELECT DISTINCT ON (1)
       date, '2_W'
      ,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
                      ORDER BY date)
FROM   x

UNION ALL
SELECT DISTINCT ON (1)
       date, '3_M'
      ,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM   x
ORDER BY 1, 2

动态时间段内的不同用户

您始终可以使用相关子查询。大桌子往往很慢!
基于之前的查询:

WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
    ,d  AS (
    SELECT date
          ,(date_trunc('week', date + 1)::date - 1) AS week_beg
          ,date_trunc('month', date)::date AS month_beg
    FROM   uniques
    GROUP  BY 1
    )
SELECT date, '1_D' AS time_series,  count(user_id) AS cnt
FROM   du
GROUP  BY 1

UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
                     WHERE  du.date BETWEEN d.week_beg AND d.date )
FROM   d
GROUP  BY date, week_beg

UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
                     WHERE  du.date BETWEEN d.month_beg AND d.date)
FROM   d
GROUP  BY date, month_beg
ORDER  BY 1,2;

SQL Fiddle 用于所有三种解决方案。

使用dense_rank() 更快

@Clodoaldo 提出了一项重大改进:使用window function dense_rank()。这是优化版本的另一个想法。立即排除每日重复项应该更快。性能提升随着每天的行数而增长。

建立在经过简化和清理的数据模型 - 没有多余的列 - day 作为列名而不是 date

datereserved word in standard SQL 和 PostgreSQL 中的基本类型名称,不应用作标识符。

CREATE TABLE uniques(
   day date     -- instead of "date"
  ,user_id int
);

改进的查询:

WITH du AS (
   SELECT DISTINCT ON (1, 2)
          day, user_id 
         ,date_trunc('week',  day + 1)::date - 1 AS week_beg
         ,date_trunc('month', day)::date         AS month_beg
   FROM   uniques
   )
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM  (
    SELECT user_id, day
          ,dense_rank() OVER(PARTITION BY week_beg  ORDER BY user_id) AS w
          ,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
    FROM   du
    ) s
GROUP  BY day
ORDER  BY day;

SQL Fiddle 展示了 4 个更快变体的性能。这取决于对您来说最快的数据分布。
所有这些都比相关子查询版本快大约 10 倍(这对于相关子查询来说还不错)。

【讨论】:

  • 感谢@Erwin 的回答。它部分解决了这个问题,因为它不计算 M 和 W 的不同 user_id。我已经更新了我的问题中的测试数据来捕捉它。
  • 是的,我从 previos 表中添加了 week_beg 和 month_beg 列(与您提到的相同)以使分组更容易。
  • 正如文档中所说的“聚合窗口函数,与普通聚合函数不同,不允许在函数参数列表中使用 DISTINCT 或 ORDER BY。”
  • @ishan:没错,窗口函数内没有DISTINCT。但是您可以在应用窗口函数之前 这样做。我添加了一个解决方案。
  • 解决方案“每天不同的用户数”还有另一个问题。当我得到 M 的数字时,我需要从月初到日期的不同用户。这会重新计算不同日期的 user_id,但是当您使用整个月时,这些 user_id 不应重复计算。非常感谢您的帮助@Erwin
【解决方案2】:

没有相关的子查询。 SQL Fiddle

with u as (
    select
        "date", user_id,
        date_trunc('week', "date" + 1)::date - 1 week_beg,
        date_trunc('month', "date")::date month_beg
    from uniques
)
select
    "date", count(distinct user_id) D,
    max(week_dr) W, max(month_dr) M
from (
    select
        user_id, "date",
        dense_rank() over(partition by week_beg order by user_id) week_dr,
        dense_rank() over(partition by month_beg order by user_id) month_dr
    from u
    ) s
group by "date"
order by "date"

【讨论】:

  • +1 非常好。我正在尝试dense_rank(),但没时间了。
  • 我想我发现了另一个改进并将其添加到我的答案中,加上测试用例。你可能会感兴趣。
【解决方案3】:

试试

SELECT
  * 
FROM 
(
  SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
  UNION
  SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
  UNION
  SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis

SQLFIDDLE

【讨论】:

  • 它没有给我所有日期的 D、W、M 值。我想要所有这些行
  • 问题是我不能在不包含“日期”的情况下进行分组
【解决方案4】:

试试这样的查询

SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period

【讨论】:

  • 我正在研究 postgresql。但无论如何,这如何解决我的问题?
猜你喜欢
  • 1970-01-01
  • 2012-01-16
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-07-14
  • 1970-01-01
相关资源
最近更新 更多