Postgresql vs Python - 按性能分组答案

【问题标题】：Postgres SQL vs Python - GROUP BY PerformancePostgresql vs Python - 按性能分组
【发布时间】：2022-10-04 17:57:58
【问题描述】：

有一个表“事务”，它有：

id（id 自动递增）
标题（文本）
说明（文字）

供应商（文本）

要求列出其中任何一个中最常用的 100 个单词及其排列（2 个单词的组合 - 忽略它们的反向排列[例如 A 和 B 的排列将是 AA、AB、BB、BA，我们想要排除 A=B 和 A>B]) 的情况。例如，如果交易将具有：

title = 贝宝付款
说明 =

供应商 = 索尼

我们希望有一个不同的单词列表 [PayPal、payment、Sony]。请注意，在某些情况下，该词可能有标点符号，我们必须删除它们。

所以预期的结果是： [贝宝，支付，索尼，支付贝宝，贝宝索尼，支付索尼]

我为 Postgres 做了一个 SQL 查询来执行此操作，但性能很糟糕：

WITH
    oneWord as (SELECT t.id, a.word, t.gross_amount
                FROM (SELECT * FROM transaction t) t,
                    unnest(string_to_array(regexp_replace(regexp_replace(
                        concat(t.vendor, ' ',
                             t.title, ' ',
                             t.description),
                      '[\s+]', ' ', 'g'), '[[:punct:]]', '', 'g'), ' ',
                '')) as a(word)
                WHERE a.word NOT IN (SELECT word FROM wordcloudexclusion)
    ),
    oneWordDistinct as (SELECT id, word, gross_amount FROM oneWord),
    twoWord as (SELECT a.id,CONCAT(a.word, ' ', b.word) as word, a.gross_amount
                from oneWord a, oneWord b
                where a.id = b.id and a < b),
    allWord as (SELECT oneWordDistinct.id as id, oneWordDistinct.word as word, oneWordDistinct.gross_amount as gross_amount
                from oneWordDistinct
                union all
                SELECT twoWord.id as id, twoWord.word as word, twoWord.gross_amount as gross_amount
                from twoWord)
SELECT a.word, count(a.id) FROM allWord a GROUP BY a.word ORDER BY 2 DESC LIMIT 100;

并在 python 中执行相同的操作，如下所示：

text_stats = {}
transactions = (SELECT id, title, description, vendor, gross_amount FROM transactions)
for [id, title, description, vendor, amount] in list(transactions):

    text = " ".join(filter(None, [title, description, vendor]))
    text_without_punctuation = re.sub(r"[.!?,]+", "", text)
    text_without_tabs = re.sub(
        r"[\n\t\r]+", " ", text_without_punctuation
    ).strip(" ")
    words = list(set(filter(None, text_without_tabs.split(" "))))
    for a_word in words:
        if a_word not in excluded_words:
            if not text_stats.get(a_word):
                text_stats[a_word] = {
                    "count": 1,
                    "amount": amount,
                    "word": a_word,
                }
            else:
                text_stats[a_word]["count"] += 1
                text_stats[a_word]["amount"] += amount
            for b_word in words:
                if b_word > a_word:
                    sentence = a_word + " " + b_word
                    if not text_stats.get(sentence):
                        text_stats[sentence] = {
                            "count": 1,
                            "amount": amount,
                            "word": sentence,
                        }
                    else:
                        text_stats[sentence]["count"] += 1
                        text_stats[sentence]["amount"] += amount

我的问题是：有没有办法提高 SQL 的性能，使其不会被 python 完全抹杀？目前在一个 20k 记录的事务表上，它需要 python~6-8 秒和 SQL 查询1 分 10 秒.

下面是 SQL 解释分析：

Limit  (cost=260096.60..260096.85 rows=100 width=40) (actual time=63928.627..63928.639 rows=100 loops=1)
  CTE oneword
    ->  Nested Loop  (cost=16.76..2467.36 rows=44080 width=44) (actual time=1.875..126.778 rows=132851 loops=1)
          ->  Seq Scan on gc_api_transaction t  (cost=0.00..907.80 rows=8816 width=110) (actual time=0.018..4.176 rows=8816 loops=1)
                Filter: (company_id = 2)
                Rows Removed by Filter: 5648
          ->  Function Scan on unnest a_2  (cost=16.76..16.89 rows=5 width=32) (actual time=0.010..0.013 rows=15 loops=8816)
                Filter: (NOT (hashed SubPlan 1))
                Rows Removed by Filter: 2
                SubPlan 1
                  ->  Seq Scan on gc_api_wordcloudexclusion  (cost=0.00..15.40 rows=540 width=118) (actual time=1.498..1.500 rows=7 loops=1)
  ->  Sort  (cost=257629.24..257629.74 rows=200 width=40) (actual time=63911.588..63911.594 rows=100 loops=1)
        Sort Key: (count(oneword.id)) DESC
        Sort Method: top-N heapsort  Memory: 36kB
        ->  HashAggregate  (cost=257619.60..257621.60 rows=200 width=40) (actual time=23000.982..63803.962 rows=1194618 loops=1)
              Group Key: oneword.word
              Batches: 85  Memory Usage: 4265kB  Disk Usage: 113344kB
              ->  Append  (cost=0.00..241207.14 rows=3282491 width=36) (actual time=1.879..5443.143 rows=2868282 loops=1)
                    ->  CTE Scan on oneword  (cost=0.00..881.60 rows=44080 width=36) (actual time=1.878..579.936 rows=132851 loops=1)
"                    ->  Subquery Scan on ""*SELECT* 2""  (cost=13085.79..223913.09 rows=3238411 width=36) (actual time=2096.116..4698.727 rows=2735431 loops=1)"
                          ->  Merge Join  (cost=13085.79..191528.98 rows=3238411 width=44) (actual time=2096.114..4492.451 rows=2735431 loops=1)
                                Merge Cond: (a_1.id = b.id)
                                Join Filter: (a_1.* < b.*)
                                Rows Removed by Join Filter: 2879000
                                ->  Sort  (cost=6542.90..6653.10 rows=44080 width=96) (actual time=1088.083..1202.200 rows=132851 loops=1)
                                      Sort Key: a_1.id
                                      Sort Method: external merge  Disk: 8512kB
                                      ->  CTE Scan on oneword a_1  (cost=0.00..881.60 rows=44080 width=96) (actual time=3.904..101.754 rows=132851 loops=1)
                                ->  Materialize  (cost=6542.90..6763.30 rows=44080 width=96) (actual time=1007.989..1348.317 rows=5614422 loops=1)
                                      ->  Sort  (cost=6542.90..6653.10 rows=44080 width=96) (actual time=1007.984..1116.011 rows=132851 loops=1)
                                            Sort Key: b.id
                                            Sort Method: external merge  Disk: 8712kB
                                            ->  CTE Scan on oneword b  (cost=0.00..881.60 rows=44080 width=96) (actual time=0.014..20.998 rows=132851 loops=1)
Planning Time: 0.537 ms
JIT:
  Functions: 49
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 6.119 ms, Inlining 0.000 ms, Optimization 2.416 ms, Emission 17.764 ms, Total 26.299 ms"
Execution Time: 63945.718 ms

Postgresql 版本：aarch64-unknown-linux-gnu 上的 PostgreSQL 14.5 (Debian 14.5-1.pgdg110+1)，由 gcc (Debian 10.2.1-6) 10.2.1 20210110 编译，64 位

【问题讨论】：

您使用的是哪个 Postgres 版本？
不会有什么不同，但是：FROM (SELECT * FROM transaction t) 可以简化为from transaction t。而 CTE oneWordDistinct 完全没用
版本：PostgreSQL 14.5
oneWordDistinct 应该是不同的单词，尚未在此场景中实现。
Python 代码不可编译。有趣的是那几秒钟。你是怎么计时的？使用 Bash 的 time：time python myscript.py 和 time psql -U myuser-d mydb -f myscript.sql？

标签： python postgresql performance

【解决方案1】：

对于数据库，索引是everytink，但您使用的是unnest、string_to_array、regexp_replace、concat 之类的方法。那些不是索引家族。

因此，为了获得最佳性能，您需要创建一个Table，如trancastion_words，其中将包含transaction_id, word 等列。其中包括transactions的话。而且您还需要创建triggers 其中listening 事务表，它将listen for every insert delete and modify for refresh transaction_words which releated 与这些记录。

之后，您需要为performanced 创建index，将transaction_words 加入itself。

我的建议是，每次都像这样获取，如果你创建 materialized view 哪个 refreshing periodicly 它会更好地用于大型系统。通过这个，您的应用程序将不会等待 db 执行。目前您的系统just 有20k 条记录，您感觉不到memory consume 之类的string_to_array 方法，但是当数据像millions or billions 数据一样增加时，由于这些方法内存消耗，您的sql 无法完成。

【讨论】：