Postgres在一列上选择重复但在另一列上不同答案

【问题标题】：Postgres select duplicate on one column but different on anotherPostgres在一列上选择重复但在另一列上不同
【发布时间】：2017-11-19 02:30:03
【问题描述】：

我有一个包含两列的数据库：

author_id, message

还有类似的条目：

123, "message!"
123, "message!"
123, "different message"
124, "message!"

我想做一个允许我选择的查询：

123, "message!"

或

124, "message!"

基本上，message 相同但author_id 不同的条目。

然后我想删除这些条目之一。（不管是哪一个，我只能选择其中一个）。

This question 让我很接近，但它适用于跨两列的重复项。

【问题讨论】：

什么是不同的作者有多个共同的消息（例如 author_id 123 和 124 都有 "message2" ）？那么理想的结果是什么？
@OtoShavadze 同样，选择其中之一即可。如果同一作者有两个重复，而第二个作者有一个，则三个作品中的任何一个。
这个表有主键吗？ -- 如果解决方案恰好选择了要删除的123, 'message!' 行，是否应该删除所有这些行？
@pozs 确实有一个主键。它应该删除所有这些，除了一个。

标签： ruby-on-rails postgresql

【解决方案1】：

还有一个替代示例：

-- Test table
CREATE TABLE dummy_data (
    author_id   int,
    message     text
);

-- Test data
INSERT INTO dummy_data ( author_id, message )
VALUES
( 123, '"message!"' ),
( 123, '"message!"' ),
( 123, '"different message"' ),
( 124, '"message!"' ),
( 124, '"message!"' ),
( 125, '"message!"' );

-- Delete query
DELETE FROM dummy_data
WHERE   ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            GROUP BY message     -- this is important to specify
        )
 -- just for test returning deleted records,
 -- you may ignore it, if don't want
RETURNING *;

-- Confirming result:
SELECT * FROM dummy_data ;
 author_id |       message
-----------+---------------------
       123 | "different message"
       125 | "message!"
(2 rows)

查看有关系统列的更多信息：https://www.postgresql.org/docs/current/static/ddl-system-columns.html

编辑：
要求通过 ID (author_id) 限制范围的附加示例。

纯查询：

DELETE FROM dummy_data
USING   ( SELECT ARRAY[ 123, 124] ) v(id)
WHERE   author_id = ANY ( v.id )
AND     ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            WHERE   author_id = ANY ( v.id )
            GROUP BY message
        );

与 cmets 相同的查询：

DELETE FROM dummy_data
-- Add your 'author_id' values into array here.
-- Reason we list it here with USING statement is
-- because we need to compare values in two places
-- and if list is too big it would be annoyance to
-- write it 2 times :)
USING   ( SELECT ARRAY[ 123, 124] ) v(id)
-- First we get all the authors in the batch by ID
WHERE   author_id = ANY ( v.id )
-- Secondly we get max CTID to ignore using same
-- authors range in batch scope
AND     ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            WHERE   author_id = ANY ( v.id )
            GROUP BY message
        );

-- This will delete following rows:
 author_id |  message
-----------+------------
       123 | "message!"
       123 | "message!"
       124 | "message!"
(3 rows)

-- Leaving the state to table:
 author_id |       message
-----------+---------------------
       123 | "different message"
       124 | "message!"
       125 | "message!"
(3 rows)

【讨论】：

这很好，但也有点慢。我的数据库中有大约 1 亿行我正在执行此操作，因此能够将其范围也很好，例如删除重复项，但仅在特定子集中，例如在数组中的那些author_ids [123,124]。你怎么能修改这个查询来处理呢？
但是如果你通过author_ids 执行此操作，那么对于[123,124] 的情况，124 的值将保持不变。但是，如果您输入[125,126]' then that's new look that doesn't know anything about last "batch". Meaning 124，“消息！”` 将保留，即使"message!" 与 125 重复。您可以吗？如果是，我可以轻松地编辑示例:)
是的，没关系。本质上，我将作者分组分配在一起，我想确保这些组中没有重复的消息。这有意义吗？
添加了其他示例。请参阅编辑部分:)

【解决方案2】：

您可以为此使用array_agg()，例如：

select author_id, message
from (
    select message, array_agg(distinct author_id) ids
    from my_table
    group by message
    ) s
cross join unnest(ids) author_id
where cardinality(ids) > 1
order by author_id;

 author_id | message  
-----------+----------
       123 | message!
       124 | message!
(2 rows)

如果你想得到一个单行的消息，查询可能更简单：

select min(author_id) as author_id, message
from my_table
group by message
having count(distinct author_id) > 1;

 author_id | message  
-----------+----------
       123 | message!
(1 row)

【讨论】：

第二个选项，我很喜欢，很简单。是否也可以选择id 列？如果我将它添加到选择中，我还必须对其进行分组，然后查询不再正常工作。

【解决方案3】：

如果我理解正确，你需要这样的东西：

with the_table (author_id, message) as (
    select 123, '"message!"' union all
    select 123, '"message!"' union all
    select 123, '"aaa!"' union all
    select 123, '"different message"' union all
    select 124, '"aaa!"' union all
    select 124, '"message!"'  union all
    select 125, '"aaa!"' union all
    select 125, '"rrrr!"'  
)


select the_table.* from  the_table 
join ( 
    select message from the_table
    group by message
    having count(distinct author_id) = (select count(distinct author_id) from the_table)
) t
on the_table.message = t.message
order by random() limit 1

随机获取一位用户的消息，这对所有author_id的都是通用的

【讨论】：