使用 Activerecord、Rails 和 Postgres 查找具有多个重复字段的行答案

【问题标题】：Find rows with multiple duplicate fields with Active Record, Rails & Postgres使用 Activerecord、Rails 和 Postgres 查找具有多个重复字段的行
【发布时间】：2014-03-07 07:28:58
【问题描述】：

使用 Postgres 和 Activerecord 在多列中查找具有重复值的记录的最佳方法是什么？

我找到了这个解决方案here：

User.find(:all, :group => [:first, :email], :having => "count(*) > 1" )

但它似乎不适用于 postgres。我收到此错误：

PG::GroupingError: ERROR: column "parts.id" 必须出现在 GROUP BY 子句中或用于聚合函数中

【问题讨论】：

在常规 SQL 中，我会使用自联接，例如 select a.id, b.id, name, email FROM user a INNER JOIN user b USING (name, email) WHERE a.id > b.id。不知道如何在 ActiveRecord-speak 中表达。

标签： ruby-on-rails postgresql activerecord

【解决方案1】：

测试和工作版本

User.select(:first,:email).group(:first,:email).having("count(*) > 1")

此外，这有点无关，但很方便。如果您想查看找到每个组合的次数，请将 .size 放在末尾：

User.select(:first,:email).group(:first,:email).having("count(*) > 1").size

你会得到一个如下所示的结果集：

{[nil, nil]=>512,
 ["Joe", "test@test.com"]=>23,
 ["Jim", "email2@gmail.com"]=>36,
 ["John", "email3@gmail.com"]=>21}

觉得这很酷，以前没见过。

感谢 Taryn，这只是她答案的一个调整版本。

【讨论】：

我必须将一个显式数组传递给select()，如：User.select([:first,:email]).group(:first,:email).having("count(*) > 1").count 才能工作。
添加.count 得到PG::UndefinedFunction: ERROR: function count
你可以试试 User.select([:first,:email]).group(:first,:email).having("count(*) > 1").map.count跨度>
我正在尝试相同的方法，但也尝试获取 User.id，将其添加到 select 和 group 返回一个空数组。如何返回整个用户模型，或者至少包含 :id？
使用.size而不是.count

【解决方案2】：

出现该错误是因为 POSTGRES 要求您将分组列放在 SELECT 子句中。

尝试：

User.select(:first,:email).group(:first,:email).having("count(*) > 1").all

（注意：未经测试，您可能需要对其进行调整）

已编辑删除 id 列

【讨论】：

那行不通； id 列不属于该组，因此您不能引用它，除非您对其进行聚合（例如 array_agg(id) 或 json_agg(id)）

【解决方案3】：

如果您需要完整模型，请尝试以下方法（基于 @newUserNameHere 的回答）。

User.where(email: User.select(:email).group(:email).having("count(*) > 1").select(:email))

这将返回该行的电子邮件地址不唯一的行。

我不知道有一种方法可以对多个属性执行此操作。

【讨论】：

``` User.where(email: User.select(:email).group(:email).having("count(*) > 1")) ```
谢谢你，效果很好:) 最后一个.select(:email) 似乎是多余的。我认为这有点清洁，但我可能是错的。 User.where(email: User.select(:email).group(:email).having("count(*) > 1"))

【解决方案4】：

如果您使用 PostgreSQL，则使用 单个查询 获取所有重复项：

def duplicated_users
  duplicated_ids = User
    .group(:first, :email)
    .having("COUNT(*) > 1")
    .select('unnest((array_agg("id"))[2:])')

  User.where(id: duplicated_ids)
end

irb> duplicated_users

【讨论】：

【解决方案5】：

在原始 SQL 中运行良好：

# select array_agg(id) from attendances group by event_id, user_id having count(*) > 1;
   array_agg   
---------------
 {3712,3711}
 {8762,8763}
 {7421,7420}
 {13478,13477}
 {15494,15493}

【讨论】：

【解决方案6】：

基于@newUserNameHere 的answer above，我相信显示每个计数的正确方法是

res = User.select('first, email, count(1)').group(:first,:email).having('count(1) > 1')

res.each {|r| puts r.attributes } ; nil

【讨论】：