【问题标题】:Find average by joining two datasets通过连接两个数据集求平均值
【发布时间】:2015-01-06 03:35:17
【问题描述】:

我有两个数据集,

EmployeeDetail(data set 1):- 
   id  
   name
   gender
   location 

SalaryDetail(data set 2):-
   id
   salary

我需要同时加入并找出每个位置的男性和女性的平均工资。所以我尝试了以下代码。

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as 
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as 
(id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;                                                                         
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation { 
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, 
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};

但它抛出错误

<line 6, column 22>  Syntax error, unexpected symbol at or near 
'JoinedEmpDetail'

谁能帮助我在哪里做错或如何正确做?

为了更清楚地说明我的要求,我提供了一些示例数据集。

EmpDetail.txt

1   Biswa   Male    Bangalore
12  Bratati Mahapatra   Female  Chennai
2   Bibhu kalyan    Male    Bangalore
3   Chinta  Male    Mumbai
10  Amrit Anand Male    Bangalore
11  Sateesh panda   Male    Bangalore
4   Kirti Kumar Male    Mumbai
6   Shruthi Female  Chennai
7   Vijay   Male    Chennai
5   Bibhu   Male    Chennai
9   Bratati  Mohanty    Female  Bangalore
8   Rupa Mahapatra  Female  Bangalore
13  Salini  Female  Mumbai
14  Priyanka Chopra Female  Mumbai

EmpSalary.txt

1   10000
12  12000
2   15900
3   9000
10  8000
11  13400
4   7600
6   22000
7   17000
5   16800
9   9800
8   10000
13  11000
14  12500

我需要的最终结果是:

Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>

【问题讨论】:

    标签: apache-pig inner-join


    【解决方案1】:

    您可以使用简单的foreach stmt 解决此问题,因此不要使用嵌套的 foreach stmt。

    Group command 不能在嵌套的 Foreach 中工作,它在 pig 中受到限制。在嵌套的 foreach 中只允许使用少数命令(CROSS、DISTINCT、FILTER、FOREACH、LIMIT 和 ORDER BY)。

    你能像这样改变你的脚本吗?

    EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
    SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);                                     
    JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
    GroupedByLocation = group JoinedEmpDetail by (location,gender);
    AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
    DUMP AverageSalary;
    

    输出:

    (Mumbai,Male,8300.0)
    (Mumbai,Female,11750.0)
    (Chennai,Male,16900.0)
    (Chennai,Female,17000.0)
    (Bangalore,Male,11825.0)
    (Bangalore,Female,9900.0)
    

    【讨论】:

    • 效果很好。非常感谢您澄清我的错误。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-02-21
    • 1970-01-01
    • 2021-12-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多