按差异过滤的绩效组答案

【问题标题】：Performant group by filtered by difference按差异过滤的绩效组
【发布时间】：2018-10-28 20:30:20
【问题描述】：

我有一张这样的桌子

CREATE TABLE "items" (
  "id" int(11) NOT NULL AUTO_INCREMENT,
  "id_ur" varchar(255) NOT NULL,
  "window_key" varchar(255) DEFAULT NULL,
  PRIMARY KEY ("id"),
  KEY "idx_window_key" ("window_key") USING BTREE,
  KEY "idx_id_ur" ("id_ur") USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

此表包含 19 000 00 行。

我需要选择与不同的window_key 共享id_ur 字段的所有记录。例如，如果我有如下记录：

id,id_ur,window_key
1,"123","ABC"
2,"124","DEF"
3,"123","ABD"
4,"124","DEF"

我需要返回“123”，而不是“124”。

我正在 MySQL Community Server 版本 5.7.22 中寻找一种执行此操作的高性能方法。

我尝试了以下方法：

select c1.id_ur
from items c1
inner join items c2
on c1.id_ur = c2.id_ur
where c1.window_key <> c2.window_key;

但这并不是真正的高性能。我尝试使用 group by 子句来表达它，但我不知道如何表达特定列上没有不同的行的分组。

我在id_ur 和window_key 字段上都有索引。我不确定在这两个字段上添加索引是否有用。

我正在寻找合适的查询来获取这些记录。

感谢我得到的一些帮助，我能够找到更高效的解决方案。

这是基准测试的结果：

select distinct c1.id_ur
from item c1, item c2
where c1.id_ur = c2.id_ur
and c1.window_key <> c2.window_key
-- 1483 secs

select c1.id_ur
from item c1
inner item c2
on c1.id_ur = c2.id_ur
where c1.window_key <> c2.window_key;
 -- 675 secs

select distinct c1.id_ur
from item c1
group by c1.id_ur
having count(distinct c1.window_key) > 1
-- 170 secs

SELECT dt.id_ur 
FROM 
(
  SELECT DISTINCT c1.id_ur, c1.window_key 
  FROM gbmlive.canonical AS c1
) AS dt 
GROUP BY dt.id_ur 
HAVING COUNT(*) > 1
-- 376 secs

因此，最快的解决方案是使用不同计数的 group by。

【问题讨论】：

当在一个字段上使用Group By时，你不需要在同一个字段上使用Distinct子句。

标签： mysql performance group-by

【解决方案1】：

同时使用 group by 和 having ：

select id_user
from items
group by id_user
having count(distinct window_key) > 1

【讨论】：

为了使其更快，请将INDEX(id_user) 更改为INDEX(id_user, window_key)。这将是一个“覆盖索引”。

【解决方案2】：

@FatemehNB 的回答很好。除此之外，您还可以尝试以下查询并比较性能：

SELECT dt.id_ur 
FROM 
(
  SELECT DISTINCT c1.id_ur, c1.window_key 
  FROM items AS c1
) AS dt 
GROUP BY dt.id_ur 
HAVING COUNT(*) > 1

【讨论】：

F 的查询一次性完成任务（加上GROUP BY 开销）。您的子查询大约是 2 次传递，以及 2 次 GROUP BYs（一个伪装成 DISTINCT）。
@RickJames 我知道；这就是为什么我说F的答案很好。这只是另一种可能的方式，有时可以提供帮助。