【问题标题】:在 Postgresql 中按最大日期查询值
【发布时间】:2022-01-03 22:43:26
【问题描述】:

我已经问过这个问题here,但关于我的问题的信息较少。因此,我创建了一个包含更多信息的新问题。

这是我的示例表。每行包含用户每次填写的数据。这样 timestamp 列将不会在整个表中为 null。如果用户没有填写,item下可能有未记录的值。 id 是为每条记录自动生成的列。

CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);    
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;

为了看得清楚,我按照idtimestamp对上表进行排序。没关系。

  • 我们使用的是 Postgresql 版本:PostgreSQL 9.5.19
  • 实际表包含超过 400 万行
  • item 列包含 500 多个不同的项目,但不要担心。我将最多使用 10 个项目进行查询。在上表中,我只使用了 4 个项目。
  • 我们还有另一个名为 Customer_table 的表,其中包含包含客户一般信息的唯一 Customer_id。

从上表中,我想查询数据以创建一个包含最新日期更新数据的表,如下所示。我将最多使用 10 个项目进行查询,因此可能有 10 列。

customer_id  price  condition  feeling   weather .......(there may be other columns from item column)
   002        1900    null      fine      rain
   001        1400     ok       fine      null
   003        2000    bad       sad       sunny

这是我从previous questions 得到的查询,但我只询问了两个item

SELECT customer_id, p.value AS price, c.value AS condition
FROM  (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'condition'
   ORDER  BY customer_id, timestamp DESC
   ) c
FULL JOIN (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'price'
   ORDER  BY customer_id, timestamp DESC
   ) p USING (customer_id)

所以,如果有更好的解决方案,请帮助我。 谢谢。

【问题讨论】:

  • 表定义的规范形式是CREATE TABLE 语句包括所有约束。显示PK、NOT NULL等,并公开相关索引!没有多少散文可以弥补这一点。此外,Postgres 9.5 已于 2021 年 2 月 EOL。升级到当前版本!大表有重大改进。甚至还有可能适用于您的情况的新功能,例如 WITH TIES
  • 也很重要:每个客户的粗略行数(最小/最大/平均)——当然是相关项目,因为其余的可以很便宜地排除在外。以及不同客户的数量:SELECT count(*) FROM Customer_table; 并且:您是否一次查询所有客户以进行选择?如果有,具体是什么选择?
  • customer_table 中还有一些更多信息,例如 customer_type 或 location 等。所以,我们通过过滤查询客户表单customer_table,并连接两个表(customer_table和我要求查询的表)形成所需的表。
  • we query the customer form customer_table by filtering 客户?你是说一个客户?如果每个查询是一个或几个或多个或所有客户,这一切都会有所不同。
  • 对不起,我的错误。我们通过过滤 customer_typelocation 等查询来自customer_table 的客户,以对该用户进行分析。然后我们通过 item 列下的 item names 从上面的巨大表中查询我们要分析的列。然后我们根据需要加入或做其他操作。

标签: sql postgresql greatest-n-per-group


【解决方案1】:

您可以尝试使用row_number 的其他方法来生成一个值,以根据最新数据过滤您的数据。然后,您可以聚合客户 ID,并根据所需的行号 rn=1(我们将按降序排列)和项目名称过滤您的记录。

这些方法不那么冗长,并且基于在线结果,似乎性能更高。让我知道在 cmets 中如何在您的环境中复制它。

您可以使用EXPLAIN ANALYZE 将此方法与当前方法进行比较。提供的在线环境中的结果:

当前方法

| Planning time: 0.129 ms                                                                                                      
| Execution time: 0.056 ms      

建议的方法 1

| Planning time: 0.061 ms                                                                                                 
| Execution time: 0.070 ms   

建议的方法 2

| Planning time: 0.047 ms                                                                                                 
| Execution time: 0.056 ms 

注意。您可以使用EXPLAIN ANALYZE 在您的环境中比较这些我们无法在线复制的方法。每次运行的结果也可能不同。还建议在 item 列上使用索引和早期过滤器以提高性能。


架构 (PostgreSQL v9.5)

建议的方法 1

SELECT
    t1.customer_id,
    MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
    MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
    MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
    MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    -- ensure that you filter based on your desired items
    -- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
   1;
customer_id conditio price feeling weather
001 ok 1400 fine
002 1900 fine rain
003 bad 2000 sad sunny

建议的方法 2

SELECT
    t1.customer_id,
    MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio,
    MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
    MAX(t1.value) FILTER (WHERE  t1.item='feeling')  as feeling,
    MAX(t1.value) FILTER (WHERE  t1.item='weather')  as weather
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    -- ensure that you filter based on your desired items
    -- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
   1;
customer_id conditio price feeling weather
001 ok 1400 fine
002 1900 fine rain
003 bad 2000 sad sunny

EXPLAIN ANALYZE 的当前方法

EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM  (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'condition'
   ORDER  BY customer_id, timestamp DESC
   ) c
FULL JOIN (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'price'
   ORDER  BY customer_id, timestamp DESC
   ) p USING (customer_id);
QUERY PLAN
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1)
Merge Cond: (tbl.customer_id = tbl_1.customer_id)
Buffers: shared hit=2
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1)
Sort Key: tbl.customer_id, tbl."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1)
Filter: (item = 'condition'::text)
Rows Removed by Filter: 22
Buffers: shared hit=1
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1)
Buffers: shared hit=1
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1)
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1)
Filter: (item = 'price'::text)
Rows Removed by Filter: 17
Buffers: shared hit=1
Planning time: 0.129 ms
Execution time: 0.056 ms

使用 EXPLAIN ANALYZE 的建议方法 1

EXPLAIN(ANALYZE,BUFFERS)
SELECT
    t1.customer_id,
    MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
    MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
   1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.061 ms
Execution time: 0.070 ms

使用 EXPLAIN ANALYZE 的建议方法 2

EXPLAIN(ANALYZE,BUFFERS)
SELECT
    t1.customer_id,
    MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
    MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
   1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.047 ms
Execution time: 0.056 ms

View working demo on DB Fiddle

【讨论】:

  • 感谢您帮助我。它有效,正是我所要求的。我将使用 建议的方法 1,但我将添加 where item IN ('price','condition'),正如您在 使用 EXPLAIN ANALYZE 的建议方法 1 中解释的那样.
  • @Yan:请注意,基于满手样本行的查询计划几乎不相关。使用您原来的大表进行测试以获得有效结果。
【解决方案2】:

您在大桌子上进行操作。你提到了 400 万行,显然在增长。查询时...

  • 所有客户
  • 所有项目
  • 几行(customer_id, item)
  • 窄行(小行)

...ggordon's solutionsrow_number() 很棒。也很短。
整个表必须在顺序扫描中进行处理。不会使用索引。
但更喜欢使用现代聚合 FILTER 语法的“Approach 2”。它更清晰,更快。在此处查看性能测试:

方法 3:使用crosstab() 旋转

crosstab() 通常更快,尤其是对于多个项目。见:

SELECT *
FROM   crosstab(
   $$
   SELECT customer_id, item, value
   FROM  (
      SELECT customer_id, item, value
           , row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
      FROM   tbl t
      WHERE  item = ANY ('{condition,price,feeling,weather}')  -- your items here ...
      ) t1
   WHERE  rn = 1
   ORDER  BY customer_id, item
   $$
 , $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$  -- ... here ...
   ) AS ct (customer_id text, condition text, price text, feeling text, weather text);  -- ... and here ...

方法 4:LATERAL 子查询

如果顶部列出的一项或多项条件不适用,则上述查询的性能会迅速下降。

对于初学者,最多只涉及“500 个不同项目”中的 10 个。这是大桌子的最大〜2%。相比之下,仅此一项就应该使以下查询之一(快得多):

SELECT *
FROM  (SELECT customer_id FROM customer) c
LEFT   JOIN LATERAL (
   SELECT value AS condition
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'condition'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t1 ON true
LEFT   JOIN LATERAL (
   SELECT value AS price
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'price'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t2 ON true
LEFT   JOIN LATERAL (
   SELECT value AS feeling
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'feeling'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t3 ON true
--  ... more?

关于LEFT JOIN LATERAL

关键是要获得一个索引扫描相对较少的查询计划,以取代大表上昂贵的顺序扫描。
显然需要一个适用的索引

CREATE INDEX ON tbl (customer_id, item);

或更好(在 Postgres 9.5 中):

CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);

在 Postgres 11 或更高版本中,这会更好,但是:

CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);

参见hereherehere

如果只有少数项目感兴趣,这些项目的部分索引会更好。

方法 5:相关子查询

SELECT c.customer_id
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price'     ORDER BY t.timestamp DESC LIMIT 1) AS price
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling'   ORDER BY t.timestamp DESC LIMIT 1) AS feeling
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather'   ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM   customer c;

不如LATERAL 多才多艺,但足以达到目的。与方法 4 相同的索引要求。

方法 5 在大多数情况下都是性能之王

db小提琴here

改进您的关系设计和/或升级到当前版本的 Postgres 也将大有帮助。

【讨论】:

  • 非常感谢您的多项建议。你给了我很多关于我的问题的想法。这让我觉得获取数据有多种解决方案,而我要做的就是找到最适合我的问题的解决方案。这些解决方案对我帮助很大。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-02-07
  • 1970-01-01
  • 2012-05-02
  • 1970-01-01
  • 2018-03-20
  • 2018-10-18
相关资源
最近更新 更多