【问题标题】:Select latest rows for combination of columns为列组合选择最新行
【发布时间】:2018-06-21 15:16:15
【问题描述】:

我有一个日志表purchase_history,用于跟踪客户的购买历史记录,我想通过date_purchased 获取给定customer_id 订单的每种产品的最新购买信息。

该表有数百万条记录,对于某些包含表中大部分记录的 customer_id(例如,某些 customer_id 的 25% 记录),我的解决方案非常慢(20 多秒),例如其他只有几行的customer_id,速度非常快(1 秒)。

表定义:

create table purchase_history (
  id int PRIMARY KEY,
  product_name VARCHAR(100),
  date_purchased date,
  customer_id int
);

一些虚拟数据:

INSERT into purchase_history VALUES (
    1, 'A', '2017-10-10', 123)
 , (2, 'A', '2017-10-11', 123)
 , (3, 'B', '2017-10-12', 123)
 , (4, 'C', '2017-10-09', 123)
 , (5, 'B', '2017-11-10', 123);

我在 (customer_id, product_name, date_purchased) 上有一个多列索引

我想要得到的结果:

5,B,2017-11-10
2,A,2017-10-11
4,C,2017-10-09

到目前为止我提出的解决方案:

SELECT *
FROM (
       SELECT DISTINCT ON (product_name) *
       FROM purchase_history
       WHERE customer_id = 123
       ORDER BY product_name, date_purchased DESC
     ) t
ORDER BY date_purchased DESC;

我想知道是否有更好或更快的解决方案?


更新:2018 年 1 月 14 日

感谢到目前为止的 cmets 和答案,并对造成的混乱表示抱歉。我想补充一些细节:

  1. 所有列都是not null,包括date_purchased
  2. 我的索引与排序匹配 (date_purchased DESC)

    create index purchase_history_idx on purchase_history(customer_id, product_name, date_purchased DESC)
    
  3. 最好使用引用另一个表的product_id,但不幸的是production_name 在任何其他表中都不存在。这是客户指定的名称。假设我有一个用户界面供客户输入他们想要购买的东西,而客户输入的确切内容是product_name。所以purchase_history 会跟踪所有客户的所有“愿望清单”。

记录数:

  • 表格共有20M条记录
  • customer_id=123 是我们最大的客户,包含 8573491 条记录,占 42%
  • customer_id=124 是我们的第二大客户,包含 3062464 条记录,占 15%

这是我原来的distinct on解决方案的解释分析:

Sort  (cost=2081285.86..2081607.09 rows=128492 width=106) (actual time=11771.444..12012.732 rows=623680 loops=1)
  Sort Key: purchase_history.date_purchased
  Sort Method: external merge  Disk: 69448kB
  ->  Unique  (cost=0.56..2061628.55 rows=128492 width=106) (actual time=0.021..11043.910 rows=623680 loops=1)
        ->  Index Scan using purchase_history_idx on purchase_history  (cost=0.56..2040413.77 rows=8485910 width=106) (actual time=0.019..8506.109 rows=8573491 loops=1)
              Index Cond: (customer_id = 123)
Planning time: 0.098 ms
Execution time: 12133.664 ms

这是 Erwin 对 CTE 解决方案的解释分析

Sort  (cost=125.62..125.87 rows=101 width=532) (actual time=30924.208..31154.908 rows=623680 loops=1)
  Sort Key: cte.date_purchased
  Sort Method: external merge  Disk: 33880kB
  CTE cte
    ->  Recursive Union  (cost=0.56..120.23 rows=101 width=39) (actual time=0.022..29772.944 rows=623680 loops=1)
          ->  Limit  (cost=0.56..0.80 rows=1 width=39) (actual time=0.020..0.020 rows=1 loops=1)
                ->  Index Scan using purchase_history_idx on purchase_history  (cost=0.56..2040413.77 rows=8485910 width=39) (actual time=0.019..0.019 rows=1 loops=1)
                      Index Cond: (customer_id = 123)
          ->  Nested Loop  (cost=0.56..11.74 rows=10 width=39) (actual time=0.046..0.047 rows=1 loops=623680)
                ->  WorkTable Scan on cte c  (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
                ->  Limit  (cost=0.56..1.13 rows=1 width=39) (actual time=0.045..0.045 rows=1 loops=623680)
                      ->  Index Scan using purchase_history_idx on purchased_history purchased_history_1  (cost=0.56..1616900.83 rows=2828637 width=39) (actual time=0.044..0.044 rows=1 loops=623680)
                            Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
  ->  CTE Scan on cte  (cost=0.00..2.02 rows=101 width=532) (actual time=0.024..30269.107 rows=623680 loops=1)
Planning time: 0.207 ms
Execution time: 31273.462 ms

让我感到惊讶的另一件事是,我的查询运行速度比customer_id=124 慢得多,它包含的记录比customer_id=123 少得多(注意:不使用索引扫描,而是使用位图索引扫描,而我不使用'不知道为什么)

Sort  (cost=1323695.21..1323812.68 rows=46988 width=106) (actual time=85739.561..85778.735 rows=109347 loops=1)
  Sort Key: purchase_history.date_purchased
  Sort Method: external merge  Disk: 14560kB
  ->  Unique  (cost=1301329.65..1316845.56 rows=46988 width=106) (actual time=60443.890..85608.347 rows=109347 loops=1)
        ->  Sort  (cost=1301329.65..1309087.61 rows=3103183 width=106) (actual time=60443.888..84727.062 rows=3062464 loops=1)
"              Sort Key: purchase_history.product_name, purchase_history.date_purchased"
              Sort Method: external merge  Disk: 427240kB
              ->  Bitmap Heap Scan on purchase_history  (cost=203634.23..606098.02 rows=3103183 width=106) (actual time=8340.662..10584.483 rows=3062464 loops=1)
                    Recheck Cond: (customer_id = 124)
                    Rows Removed by Index Recheck: 4603902
                    Heap Blocks: exact=41158 lossy=132301
                    ->  Bitmap Index Scan on purchase_history_idx  (cost=0.00..202858.43 rows=3103183 width=0) (actual time=8331.711..8331.711 rows=3062464 loops=1)
                          Index Cond: (customer_id = 124)
Planning time: 0.102 ms
Execution time: 85872.871 ms

2018 年 1 月 15 日更新

这是 riskop 询问的explain (analyze,buffers)

GroupAggregate  (cost=0.56..683302.46 rows=128492 width=31) (actual time=0.028..5156.113 rows=623680 loops=1)
  Group Key: product_name
  Buffers: shared hit=1242675
  ->  Index Only Scan using purchase_history_idx on purchase_history  (cost=0.56..639587.99 rows=8485910 width=31) (actual time=0.022..2673.661 rows=8573491 loops=1)
        Index Cond: (customer_id = 123)
        Heap Fetches: 0
        Buffers: shared hit=1242675
Planning time: 0.079 ms
Execution time: 5272.877 ms

请注意,即使它更快,我也不能使用此查询,原因有两个:

  1. 查询中未指定排序,而我的预期结果集按date_purchased DESC 排序
  2. 我还需要在结果集中包含几列。所以我不能只使用group by

解决这两个问题的一种方法是将 riskop 的基于 group by 的查询用作子查询或 CTE,根据需要添加 order by 和更多列。


2018 年 1 月 21 日更新

利用“松散索引扫描”听起来不错,但不幸的是product_name 是高度分布的。有 1810440 个唯一的 product_name 和 2565179 个唯一的 product_namecustomer_id 组合:

select count(distinct product_name) from purchase_history; -- 1810440

select count(distinct (customer_id, product_name)) from purchase_history; -- 2565179

因此,对 riskop 的 313 毫秒查询对我来说用了 33 秒:

Sort  (cost=122.42..122.68 rows=101 width=532) (actual time=33509.943..33748.856 rows=623680 loops=1)
  Sort Key: cte.date_purchased
  Sort Method: external merge  Disk: 33880kB
"  Buffers: shared hit=3053791 read=69706, temp read=4244 written=8484"
  CTE cte
    ->  Recursive Union  (cost=0.56..117.04 rows=101 width=39) (actual time=5.886..32288.212 rows=623680 loops=1)
          Buffers: shared hit=3053788 read=69706
          ->  Limit  (cost=0.56..0.77 rows=1 width=39) (actual time=5.885..5.885 rows=1 loops=1)
                Buffers: shared hit=5 read=3
                ->  Index Scan using purchase_history_idx on purchase_history  (cost=0.56..1809076.40 rows=8543899 width=39) (actual time=5.882..5.882 rows=1 loops=1)
                      Index Cond: (customer_id = 123)
                      Buffers: shared hit=5 read=3
          ->  Nested Loop  (cost=0.56..11.42 rows=10 width=39) (actual time=0.050..0.051 rows=1 loops=623680)
                Buffers: shared hit=3053783 read=69703
                ->  WorkTable Scan on cte c  (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
                ->  Limit  (cost=0.56..1.10 rows=1 width=39) (actual time=0.049..0.049 rows=1 loops=623680)
                      Buffers: shared hit=3053783 read=69703
                      ->  Index Scan using purchase_history_idx on purchase_history purchase_history_1  (cost=0.56..1537840.29 rows=2847966 width=39) (actual time=0.048..0.048 rows=1 loops=623680)
                            Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
                            Buffers: shared hit=3053783 read=69703
  ->  CTE Scan on cte  (cost=0.00..2.02 rows=101 width=532) (actual time=5.889..32826.816 rows=623680 loops=1)
"        Buffers: shared hit=3053788 read=69706, temp written=4240"
Planning time: 0.278 ms
Execution time: 33873.798 ms

请注意,它进行了内存排序:Sort Method: quicksort Memory: 853kB 用于 riskop,但外部磁盘排序:Sort Method: external merge Disk: 33880kB 用于我。

如果这不是关系数据库的可解决问题,我想知道是否还有其他非关系数据库或基于大数据的解决方案,只要它满足 2 个要求:

  1. 合理的响应时间(例如 2 秒)。
  2. 实时无延迟。

【问题讨论】:

  • 如果有很多行,它有时会采用顺序扫描。你能发布一个解释分析吗?我不认为group by,即。 “select product_name, date_purchased from purchase_history where customer_id = 123 group by product_name, date_purchased”会有所帮助,但值得一试。
  • {product_name, date_purchased} 可能是自然键。 (如果它是唯一的,它不是) {customer_id, date_purchased} 相同,因此您最终将它们三个都作为自然键。 (如果 date_purchased 足够独特...... -->> 它应该是一个时间戳)
  • 那么你有答案了吗?
  • 您可以创建一个包含列 (customer_id,product_id,last_purchase_date,id) 的“帮助”表。在该表中 customer_id 和 product_id 将是复合键。根据您 1 月 21 日的更新。该表中将有大约 250 万条记录。这比原来的要少得多。您还可以在此表的列(customer_id、last_purchase_date)上建立索引。我希望搜索 customer_id + last_purchase_date 的查询会很快。这样做的代价是每次向 20M 的表中插入一条记录时,您都必须维护新表及其索引。

标签: sql postgresql greatest-n-per-group query-performance postgresql-performance


【解决方案1】:

尽量明确地表达你的GROUP BY

SELECT *
FROM purchase_history ph
JOIN 
(
       SELECT product_name, MAX(date_purchased) max_date_purchased
       FROM purchase_history
       WHERE customer_id = 123
       GROUP BY product_name
) t ON ph.product_name = t.product_name and
       ph.date_purchased = t.max_date_purchased
       ph.customer_id = 123
ORDER BY ph.date_purchased DESC;

另一种解决方案是使用窗口函数

SELECT *
FROM 
(
       SELECT *,
             dense_rank() over (partition by product_name order by date_purchased desc) rn
       FROM purchase_history
       WHERE customer_id = 123
) t 
WHERE t.rn = 1
ORDER BY t.date_purchased DESC;

测试一下,你会看到哪个性能更好。

【讨论】:

  • 虽然查询看起来不错,但我不希望它们比原始查询快得多。但是,有一个极端情况错误:MAX(date_purchased) 不等于原始如果可以涉及 NULL 值(根据表定义是这种情况)。
【解决方案2】:

索引

Postgres 可以非常有效地向后扫描索引,但我仍然会让该索引完美匹配:

(customer_id, product_name, date_purchased <b>DESC</b>)

这是一个小的优化,但由于 date_purchased 根据您的表定义可以为 NULL,您可能需要 ORDER BY product_name, date_purchased DESCNULLS LAST,它应该伴随一个匹配索引 -然后是一个主要的优化:

CREATE INDEX new_idx ON purchase_history
(customer_id, product_name, date_purchased DESC NULLS LAST);

相关:

查询

DISTINCT ON 对于每个 (customer_id, product_name)few 行非常有效,但对于 many 则不太有效行,这是你的弱点。

这个递归CTE应该能够完美地利用一个匹配索引:

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, product_name, date_purchased
   FROM   purchase_history
   WHERE  customer_id = 123
   ORDER  BY product_name, date_purchased DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, product_name, date_purchased
      FROM   purchase_history
      WHERE  customer_id = 123               -- repeat condition
      AND    product_name > c.product_name   -- lateral reference
      ORDER  BY product_name, date_purchased DESC NULLS LAST
      LIMIT  1
      ) u
   )
TABLE  cte
ORDER  BY date_purchased DESC NULLS LAST;

dbfiddle here

相关,有详细说明:

您甚至可以分叉逻辑并为行数多的客户运行 rCTE,同时为行数少的客户坚持使用DISTINCT ON ...

架构

值得注意的是,您的表purchase_historyproduct_name VARCHAR(100)。在一个完美的世界(规范化模式)中,这将是 product_id int (使用对 product 表的 FK 引用)。这将通过多种方式提高性能:更小的表和索引,在integer 而不是varchar(100) 上的操作速度大大加快。

不动产:

【讨论】:

    【解决方案3】:

    我认为最重要的问题是产品名称在您的数据中的分布情况。

    您提到用户使用产品名称填写此信息,所以我您有几千个不同的 product_name 值。

    如果是这种情况,那么我认为您的问题是 Postgresql 没有使用“松散索引扫描”(https://wiki.postgresql.org/wiki/Loose_indexscan),即使不同的值是与记录的总数相比很小。

    描述与您的案例非常相似的好文章:http://malisper.me/the-missing-postgres-scan-the-loose-index-scan/

    所以我试图重现您的大数据集。以下过程创建的测试数据包含 2000 万行。有 10000 种产品(product_name 是 0 到 10000 之间的随机值)。有 45 个不同的 customer_id,43% 是“123”,15% 是“124”,其余 42% 随机分布在 59 和 100 之间。 date_purchased 是 1092-04-05 和 1913-08-19 之间的随机日期。

    do '
    begin 
    drop table purchase_history;
    create table purchase_history (
      id int,
      product_name VARCHAR(100) not null,
      date_purchased date not null,
      customer_id int not null
    );
    FOR i IN 0..20000000 - 1 LOOP
    insert into purchase_history values (
    i, 
    (select trunc(random() * 10000)), 
    to_date('''' || (select trunc(random() * 300000 + 2120000)), ''J''), 
    (select trunc(random() * 100))
    );
    end loop;
    update purchase_history set customer_id=123 where customer_id < 43;
    update purchase_history set customer_id=124 where customer_id < 58;
    ALTER TABLE purchase_history ADD PRIMARY KEY (id);
    end;
    '
    

    索引与您的帖子中的相同:

    CREATE INDEX idx ON purchase_history
    (customer_id, product_name, date_purchased desc);
    

    只是为了确保我们确实有 10000 个不同的 product_name:

    SELECT product_name FROM purchase_history GROUP BY product_name;
    

    现在“参考”查询在此数据集上运行时间为 3200 毫秒:

    explain (analyze,buffers)
    SELECT product_name, max(date_purchased)
    FROM purchase_history 
    WHERE customer_id = 123
    GROUP BY product_name
    order by max(date_purchased) desc;
    

    执行:

    Sort  (cost=171598.50..171599.00 rows=200 width=222) (actual time=3219.176..3219.737 rows=10000 loops=1)
    Sort Key: (max(date_purchased)) DESC
    Sort Method: quicksort  Memory: 853kB
    Buffers: shared hit=3 read=105201 written=11891
    ->  HashAggregate  (cost=171588.86..171590.86 rows=200 width=222) (actual time=3216.382..3217.361 rows=10000 loops=1)
          Group Key: product_name
          Buffers: shared hit=3 read=105201 written=11891
          ->  Bitmap Heap Scan on purchase_history  (cost=2319.56..171088.86 rows=100000 width=222) (actual time=766.196..1634.934 rows=8599329 loops=1)
                Recheck Cond: (customer_id = 123)
                Rows Removed by Index Recheck: 15263
                Heap Blocks: exact=45627 lossy=26625
                Buffers: shared hit=3 read=105201 written=11891
                ->  Bitmap Index Scan on idx  (cost=0.00..2294.56 rows=100000 width=0) (actual time=759.686..759.686 rows=8599329 loops=1)
                      Index Cond: (customer_id = 123)
                      Buffers: shared hit=3 read=32949 written=11859
    Planning time: 0.192 ms
    Execution time: 3220.096 ms
    

    优化后的查询 - 与 Erwin 的基本相同 - 使用索引并在迭代 CTE(误导性地命名为“递归”CTE)的帮助下执行“松散索引扫描”,运行时间仅为 310 毫秒,大约是 10 倍更快:

    explain (analyze,buffers)
    WITH RECURSIVE cte AS (
       (  -- parentheses required
       SELECT id, product_name, date_purchased
       FROM   purchase_history
       WHERE  customer_id = 123
       ORDER  BY product_name, date_purchased DESC
       LIMIT  1
       )
       UNION ALL
       SELECT u.*
       FROM   cte c
       ,      LATERAL (
          SELECT id, product_name, date_purchased
          FROM   purchase_history
          WHERE  customer_id = 123               -- repeat condition
          AND    product_name > c.product_name   -- lateral reference
          ORDER  BY product_name, date_purchased DESC
          LIMIT  1
          ) u
       )
    TABLE  cte
    ORDER  BY date_purchased DESC NULLS LAST;
    

    执行:

    Sort  (cost=444.02..444.27 rows=101 width=226) (actual time=312.928..313.585 rows=10000 loops=1)
    Sort Key: cte.date_purchased DESC NULLS LAST
    Sort Method: quicksort  Memory: 853kB
    Buffers: shared hit=31432 read=18617 written=14
    CTE cte
      ->  Recursive Union  (cost=0.56..438.64 rows=101 width=226) (actual time=0.054..308.678 rows=10000 loops=1)
            Buffers: shared hit=31432 read=18617 written=14
            ->  Limit  (cost=0.56..3.79 rows=1 width=226) (actual time=0.052..0.053 rows=1 loops=1)
                  Buffers: shared hit=4 read=1
                  ->  Index Scan using idx on purchase_history  (cost=0.56..322826.56 rows=100000 width=226) (actual time=0.050..0.050 rows=1 loops=1)
                        Index Cond: (customer_id = 123)
                        Buffers: shared hit=4 read=1
            ->  Nested Loop  (cost=0.56..43.28 rows=10 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
                  Buffers: shared hit=31428 read=18616 written=14
                  ->  WorkTable Scan on cte c  (cost=0.00..0.20 rows=10 width=218) (actual time=0.000..0.000 rows=1 loops=10000)
                  ->  Limit  (cost=0.56..4.29 rows=1 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
                        Buffers: shared hit=31428 read=18616 written=14
                        ->  Index Scan using idx on purchase_history purchase_history_1  (cost=0.56..124191.22 rows=33333 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
                              Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
                              Buffers: shared hit=31428 read=18616 written=14
    ->  CTE Scan on cte  (cost=0.00..2.02 rows=101 width=226) (actual time=0.058..310.821 rows=10000 loops=1)
          Buffers: shared hit=31432 read=18617 written=14
    Planning time: 0.418 ms
    Execution time: 313.988 ms
    

    【讨论】:

      【解决方案4】:

      您能告诉我们以下简化查询在您的环境中的结果吗?

      explain (analyze,buffers)
      SELECT product_name, max(date_purchased) 
      FROM purchase_history 
      WHERE customer_id = 123
      GROUP BY product_name;
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-12-01
        • 2017-03-17
        • 2013-12-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-12-09
        • 1970-01-01
        相关资源
        最近更新 更多