在 Hive 中与 group by 一起使用的计数功能不同答案

【问题标题】：Distinct in count function used along with group by in Hive在 Hive 中与 group by 一起使用的计数功能不同
【发布时间】：2015-02-18 23:05:07
【问题描述】：

我想在 Hive 的表中查找重复项，如下所示。

ID      name    phone
1       John    602-230-4040
2       Brian   602-230-3030
3       John    602-230-4040
4       Brian   602-230-3030
5       Jeff    602-230-4040

在关系数据库中，使用带有 group by 和 having 子句的 count 函数的最简单方法。当我使用以下查询时，

select count(name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;

以下异常被抛出

FAILED: UDFArgumentException DISTINCT keyword must be specified

然后我在查询中使用了 distinct 关键字。

select count(distinct name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;

显然查询没有返回任何行，因为如果我使用 distinct 关键字，结果中不会有任何重复记录。

我不确定为什么 Hive 在与 group by 子句一起使用时强制使用 distinct 关键字和 count 函数。

谁能告诉我如何在 Hive 表中查找重复项？

【问题讨论】：

标签： hive

【解决方案1】：

如果我正确理解您的用例，您实际上想要COUNT(*)，因为您对纯行数感兴趣。

SELECT name, phone, COUNT(*) AS cnt FROM mytest GROUP BY name, phone HAVING cnt > 1;

当我对您的测试数据使用此查询时：

hive> SELECT id, name, phone FROM foo;
OK
1   John    602-230-4040
2   Brian   602-230-3030
3   John    602-230-4040
4   Brian   602-230-3030
5   Jeff    602-230-4040
Time taken: 0.32 seconds, Fetched: 5 row(s)
hive> SELECT name, phone, COUNT(*) AS cnt
    > FROM foo GROUP BY name, phone HAVING cnt > 1;
...
... Lots of MapReduce spam
...
Brian       602-230-3030    2
John        602-230-4040    2

【讨论】：