Apache Pig 从具有组的数据集中获取最大值答案

【问题标题】：Apache Pig fetch max from a data-set that has GroupsApache Pig 从具有组的数据集中获取最大值
【发布时间】：2017-11-24 01:15:10
【问题描述】：

我有一个数据集存储在 HDFS 的一个名为 temp.txt 的文件中，如下所示：

US,Arizona,51.7
US,California,56.7
US,Bullhead City,51.1
India,Jaisalmer,42.4
Libya,Aziziya,57.8
Iran,Lut Desert,70.7
India,Banda,42.4

现在，我通过以下命令将其加载到 Pig 内存中：

temp_input = LOAD '/WC/temp.txt' USING PigStorage(',') as 
(country:chararray,city:chararray,temp:double);

在此之后，我将 temp_input 中的所有数据分组为：

 group_country = GROUP temp_input BY country;

当我在 group_country 中转储数据时，屏幕上会显示以下输出：

(US,{(US,Bullhead City,51.1),(US,California,56.7),(US,Arizona,51.7)})
(Iran,{(Iran,Lut Desert,70.7)})
(India,{(India,Banda,42.4),(India,Jaisalmer,42.4)})
(Libya,{(Libya,Aziziya,57.8)})

对数据集进行分组后，我尝试通过以下查询获取 group_country 中每个国家/地区的国家名称和个人最高气温：

max_temp = foreach group_country generate group,max(temp);

这会产生一个看起来像这样的错误：

017-06-21 13:20:34,708 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1070: Could not resolve max using imports: [, java.lang., 
org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /opt/ecosystems/pig-0.16.0/pig_1498026994684.log

解决此错误并获取所需结果的下一步应该是什么。感谢所有帮助。

【问题讨论】：

MAX 有效，max 无效。
感谢您的提醒！

标签： hadoop mapreduce hdfs apache-pig

【解决方案1】：

在转换关系 pig 使用 describe relationname 时，这将有助于了解如何进行迭代。所以在你的情况下：

desribe group_country;

应该给你这样的输出：

group_country: {group: chararray,temp_input: {(country: chararray,city: chararray,temp: double)}}

然后查询：

max_temp = foreach group_country GENERATE group,MAX(temp_input.temp);

输出：

(US,56.7) (Iran,70.7) (India,42.4) (Libya,57.8)

根据评论更新：

finaldata = foreach group_country {
    orderedset = order temp_input by temp DESC;
    maxtemps = limit orderedset 1;
    generate flatten(maxtemps);
}

【讨论】：

@TKHNsweet！那真是一针见血。我为此挂了很长时间。你先生地区救世主。
@TKHN如果我想显示对应的城市呢？
非常感谢。我今天学到了很多。我必须说很棒的方法。 @TKHN