hadoop级联如何获得前N个元组答案

【问题标题】：hadoop cascading how to get top N tupleshadoop级联如何获得前N个元组
【发布时间】：2014-03-19 16:20:45
【问题描述】：

级联新手，试图找到一种基于排序/顺序获取前 N 个元组的方法。例如，我想知道人们使用的前 100 个名字。

我可以在 teradata sql 中做类似的事情：

select top 100 first_name, num_records   
from
    (select first_name, count(1) as num_records   
     from table_1  
     group by first_name) a  
order by num_records DESC

hadoop pig 也有类似的情况

a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;

在 SQL 或 Pig 中似乎很容易做到，但很难找到一种方法在级联中做到这一点。请指教！

【问题讨论】：

标签： hadoop mapreduce sql-order-by cascading

【解决方案1】：

假设您只需要设置 Pipe 来了解如何执行此操作：

在级联 2.1.6 中，

Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,  
                                 new Fields("first_name"),
                                 );

firstNamePipe = new Every(firstNamePipe, new Fields("first_name"), 
                          new Count("num_records"), Fields.All);

firstNamePipe = new GroupBy(firstNamePipe,  
                                 new Fields("first_name"),
                                 new Fields("num_records"),
                                 true); //where true is descending order

firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
                          new First(Fields.Args, 100), Fields.All)

InPipe 由您传入的点击形成，其中包含您在上面引用的元组数据。即，“名字”。 “num_records”是在调用new Count() 时创建的。

如果您将“num_records”和“first_name”数据放在单独的水龙头（表或文件）中，那么您可以设置两个指向这两个 Tap 源的管道并使用 CoGroup 连接它们。

我使用的定义来自 Cascading 2.1.6：

GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)

Count(Fields fieldDeclaration)

First(Fields fieldDeclaration, int firstN)

【讨论】：

嗨Engineiro，我认为您是在“first_name”字段上进行分组，并对同一组内的num_records进行排序，即仅在具有相同名字的组内排序。但我在这里想要做的是获得最高的名字。有点像一个组，然后得到最上面的行。
到目前为止我能想到的是将一个常量字段添加到 {first_name, num_records} 方案中，并在该常量字段上进行分组，以获得一个单独的组。然后对 num_records 排序并获得前 N 个。
你是对的。我做了一些编辑。请记住，这都是本地排序。 Hadoop 和级联一般都不太热衷于总排序。对于总排序，您需要一个级联减速器。

【解决方案2】：

方法一 使用 GroupBy 并根据所需的列对它们进行分组，您可以利用级联提供的二级排序，默认情况下它按升序提供它们，如果我们希望它们按降序排列，我们可以通过 reverseorder()

获取 TOP n 元组或行

它非常简单，只需在 FILTER 中使用 静态变量 计数，并为每个元组计数值增加 1 增加 1 并检查天气它是否大于 N

当计数值大于N时返回真，否则返回假

这将为输出提供前 N 个元组

方法2

级联提供了一个独特的 inbuit 函数，它返回 firstNbuffer

请看下面的链接 http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html

【讨论】：