在 postgres 中，使用 split_part 函数而不是类似语句是否很好答案

【问题标题】：In postgres, is it good to use split_part function instead like statements在 postgres 中，使用 split_part 函数而不是类似语句是否很好
【发布时间】：2021-07-26 15:45:57
【问题描述】：

谁能解释或建议我们可以在 Postgres 中使用 split_part 而不是 like。

在我的用例中，名称列将包含一些中间字符串，这对于特定类别是常见的。比如Vinod.Game1、Vinod.Game2、Vinod.Game3等

现在我想获取 Vinod 玩过的游戏数量及其详细信息。我有两个选择：

select * from games where name like 'Vinod.Game%'

或

select * from games where split_part(name, '.Game', 1) = 'Vinod'

当我检查 200 行的数据时，我看到了 beloe stats

对于 Like 查询：

Planning time: 120.326 ms
 Execution time: 2.878 ms

对于 split_part 查询：

Planning time: 8.845 ms
 Execution time: 3.681 ms

您能否帮助我了解计划时间对查询的影响。如果我们有千兆数据库，哪个更好用（split_part vs like）？

                                              Table "public.games"
      Column      |         Type          | Collation | Nullable |        Default         | Storage  | Stats target | Description
------------------+-----------------------+-----------+----------+------------------------+----------+--------------+-------------
 id               | character varying(32) |           | not null |                        | extended |              |
 access           | character varying(50) |           |          |                        | extended |              |
 deleted          | character varying(1)  |           |          | 'N'::character varying | extended |              |
 timePlayed       | character varying(50) |           |          |                        | extended |              |
 description      | character varying     |           |          |                        | extended |              |
 name             | character varying(64) |           |          |                        | extended |              |

【问题讨论】：

这看起来很不寻常。这是可重复的吗？表是如何定义的？
尝试在(name) 上建立索引。 LIKE 可以在通配符仅在末尾时使用。
表架构更新了，可重复是什么意思？
@VinodKumarChaganti 。 . .小表上的计时（例如 200 行）通常不可重现或特别有意义。尝试一百万行。
当我尝试使用 dbfiddle (dbfiddle.uk/…) 时，我发现 like 通常很快，但实际上并不是很多。有时split_part() 会获胜。我认为这只是意味着split_part() 有一个有效的实现。我会去有三个原因：（1）它是标准的； (2) 它通常性能更好； (3) 在某些情况下它可以使用索引。

标签： sql postgresql performance sql-like

【解决方案1】：

正确的解决方案是修复你的数据模型。

不要将分隔值存储在单个列中。从长远来看，这会一次又一次地伤害你。

但通常这类问题的答案是“我没有创造它，我必须忍受它”，你需要测试这两种方法。

要获得有意义的测试，您必须创建超过 200 行。

我使用这种方法创建了一些假数据：

create table games
(
 id               character varying(32)  not null,
 access           character varying(50),
 deleted          character varying(1) default 'N'::character varying ,
 timePlayed       character varying(50),
 description      text ,
 name             character varying(64)                                   
);

insert into games(id, access, timeplayed, description, name)
select g.id::text, 
       'full',
       'all night long',
       'some description',
       case 
          when random() < 0.1 then 'Vinod.'
          when random() < 0.2 then 'Vivos.'
          when random() < 0.3 then 'Doniv.'
          when random() < 0.5 then 'Novid.'
          when random() < 0.6 then 'Somevid.'
          when random() < 0.7 then 'OtherVid.'
          when random() < 0.8 then 'Fonod.'
          else 'Barnod.'
       end || 'Game' || (random() * 999 + 1)::int
from generate_series(1,1e6) as g(id);

create index on games (name varchar_pattern_ops);
create index on games  ( (split_part(name, '.', 1)) );
vacuum analyze games;

上面生成 100 万行，其中 10% 以 Vinod. 开头

请注意，我只在 . 上拆分而不是在 .Game 上拆分 - 对我来说这更有意义：选择由点分隔的第一个元素。

当表被缓存时，LIKE 查询的执行时间约为 70 毫秒，split_part() 查询的执行时间约为 25 毫秒（带有 Postgres 13.2 的 Windows 10 笔记本电脑）。因此，使用基于表达式的索引的 split_part() 似乎是赢家。

explain (analyze, buffers)
select * 
from games 
where name like 'Vinod.Game%';


QUERY PLAN                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------
Index Scan using games_name_idx on games  (cost=0.42..12490.69 rows=99080 width=59) (actual time=0.018..67.705 rows=100189 loops=1)
  Index Cond: (((name)::text ~>=~ 'Vinod.Game'::text) AND ((name)::text ~<~ 'Vinod.Gamf'::text))                                   
  Filter: ((name)::text ~~ 'Vinod.Game%'::text)                                                                                    
  Buffers: shared hit=99863                                                                                                        
Planning Time: 0.669 ms                                                                                                            
Execution Time: 70.365 ms

explain (analyze, buffers)
select * 
from games 
where split_part(name, '.', 1) = 'Vinod'

QUERY PLAN                                                                                                                               
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using games_split_part_idx on games  (cost=0.42..11657.32 rows=99400 width=59) (actual time=0.025..20.793 rows=100189 loops=1)
  Index Cond: (split_part((name)::text, '.'::text, 1) = 'Vinod'::text)                                                                   
  Buffers: shared hit=11450                                                                                                              
Planning Time: 0.098 ms                                                                                                                  
Execution Time: 23.605 ms

但同样：解决问题的正确方法是规范化数据模型。

【讨论】：

【解决方案2】：

使用 LIKE 的第一个查询将对表进行扫描（不是索引扫描，因为丑陋的“SELECT *”...）。

使用 split_part 的第二个查询将构建一个内部数据集（split_part 函数），并对结果数据集进行扫描以查找 qulifieds 行。

在这两种情况下，根本没有索引可以加快您的查询，因为您的谓词不可搜索。

事实上，您查询的数据违反了第一范式（atomci 数据）。当这样的mistqke完成时，您的数据库根本不是关系数据库，当您没有关系数据库时，RDBMS 无法用性能来处理它，因为 RDBMS 专门设计用于操作关系，而不是“cobol”类型的数据结构！

【讨论】：

B-Tree 索引绝对可以支持第一个查询（使用varchar_patter_ops 运算符类时）。对于第二个查询，可以创建一个基于表达式的索引create index on games ( (split_part(name, '.Game', 1)) )。但我同意这是一个糟糕的数据库设计。
@a_horse_with_no_name 是的，但前提是 %joker 字符始终位于标准值的末尾。其次，拆分功能索引的成本将是缺乏性能......
基于表达式的索引并不比普通索引贵。