高效流行词PostgreSQL查询答案

【问题标题】：Efficient Popular Words PostgreSQL query高效流行词PostgreSQL查询
【发布时间】：2022-01-15 22:57:07
【问题描述】：

更新：添加了第二个表现不佳的查询

我正在尝试编写一个查询，以生成一个大型、不断变化的数据集的流行词。但是，我无法获得有用的结果或在合理的时间范围内结束查询。

我遇到的问题有两个：

当我在下面的查询中使用 tsvector 时，我的结果是词位，我不想向最终用户显示。

SELECT word, ndoc
FROM
ts_stat($$ SELECT normalized_tsvector FROM activities 
   WHERE
      activities.identifiedat >
      current_timestamp - interval '60 minutes'
   $$)
WHERE word NOT IN ('like', 'to', 'the', 'at', 'in', 'a')
ORDER BY ndoc DESC LIMIT 50

这会返回像“peopl”这样的词素，而不是“people”，这些词没有用处。

我发现 here 的查询太慢了 - 即使在一组样本数据（700 项）上，它也运行了 20 分钟而没有返回。

我试过的查询如下：

with popular_words as (
    select word
    from ts_stat('select normalized_tsvector from activities')
    where nentry > 1
    and not word in ('to', 'the', 'at', 'in', 'a')
)                                              
select concat_ws(' ', a1.word, a2.word) phrase, count(*)
from popular_words as a1        
cross join popular_words as a2
cross join activities                                                                                                   
where normalized ilike format('%%%s %s%%', a1.word, a2.word)
group by 1                                                         
having count(*) > 1
order by 2 desc;

我的问题有两个：

有没有办法将词位转换回单词，或者至少可以匹配哪些单词？我想知道是否可以运行单独的查询来查找与给定词位匹配的单词的最常见用法。
有没有办法提高第二个查询的性能？也许是某种索引？

我的表如下：

CREATE TABLE IF NOT EXISTS activities (
    id SERIAL PRIMARY KEY,
    document JSONB,
    normalized TEXT,
    identifiedat TIMESTAMP with time zone DEFAULT now(),
    instance VARCHAR(1000) NOT NULL
);

ALTER TABLE activities 
  ADD normalized_tsvector tsvector
    GENERATED ALWAYS AS (to_tsvector('english', normalized)) STORED;


CREATE UNIQUE INDEX IF NOT EXISTS  activities_uri_idx ON activities ( (document->>'id') );

CREATE INDEX IF NOT EXISTS activities_published_idx ON activities ( (document->>'published') );
CREATE INDEX IF NOT EXISTS activities_identifiedat_idx ON activities (identifiedat);

CREATE INDEX IF NOT EXISTS normalized_idx ON activities USING gin(normalized_tsvector);

CREATE INDEX IF NOT EXISTS activities_id_idx ON activities (id);

谢谢！

【问题讨论】：

您链接到的查询太慢了，实际上并没有达到您想要的效果。您一定已经修改了它，但我们不知道如何修改。请告诉我们你实际做了什么。
@jjanes 很好的标注，请查看更新后的帖子。

标签： sql postgresql search indexing full-text-search

【解决方案1】：

使用“simple”配置很容易得到无词干词，然后用英语词干，找出每个词干词最常见的原始词。然后将其加入到您的原始查询中。

with t as (SELECT word, ndoc
FROM
ts_stat($$ SELECT normalized_tsvector FROM activities 
   WHERE
      activities.identifiedat >
      current_timestamp - interval '60 minutes'
   $$)
WHERE word NOT IN ('like', 'to', 'the', 'at', 'in', 'a')
ORDER BY ndoc DESC LIMIT 50),
lex as (SELECT distinct on (lexeme) to_tsvector('english',word) as lexeme, word as original_word
FROM
ts_stat($$ SELECT to_tsvector('simple',normalized) FROM activities 
   WHERE
      activities.identifiedat >
      current_timestamp - interval '60 minutes'
   $$)
ORDER BY lexeme, ndoc desc)
select t.*, original_word from t left join lex on (to_tsvector('english',t.word)=lex.lexeme);

您正在定义自己的停用词列表，但这应该是不必要的，因为“英语”带有自己的列表。但我没有删除您查询的那部分。

这需要大约 3.5 秒来处理 52,670 个文档，其中包含 2,457,189 单个代币和 75,732 个不同的代币。（最明显的标记是拼写错误、电子邮件地址、url 或十六进制摘要值——数据来自 git commit 消息的主体，去掉了标题，但它们仍然有一些杂乱无章）

【讨论】：