【问题标题】:How do I get matching values in PIG without using UDF?如何在不使用 UDF 的情况下在 PIG 中获取匹配值?
【发布时间】:2015-06-20 07:53:47
【问题描述】:

将这些视为我的输入文件,

 Input 1: (File 1)
 12,23,14,15,9
 1,2,3,4,5
 34,17,8
 .
 .

 Input 2: (File 2)
 12 Twelve
 23 TwentyThree
 34 ThirtyFour
 .
 .

我将使用我的 PIG 脚本从“输入 1”文件中读取每一行,我希望根据“输入 2”文件获得如下结果。

 Output:
 Twelve,TwentyThree,Fourteen,Fifteen,Nine
 One,Two,Three,Four,Five
 .
 .

没有 UDF 是否可以实现这一点?请告诉我您的建议。

提前致谢!

【问题讨论】:

    标签: hadoop apache-pig


    【解决方案1】:

    这违反了您的“无 UDF”标准,但 UDF 是内置的,所以我怀疑它就足够了。

    查询:

    data1 = LOAD 'file1' AS (val:chararray);
    data2 = LOAD 'file2' AS (num:chararray, desc:chararray);
    
    A = RANK data1;  /* creates row number*/
    B = FOREACH A GENERATE rank_data1, FLATTEN(TOKENIZE(val, ',')) AS num;
    C = RANK B;  /* used to keep tuple elements sorted in bag*/
    D = JOIN C BY num, data2 BY num;
    E = FOREACH D GENERATE C::rank_data1 AS rank_1:long
                         , C::rank_B AS rank_2:long
                         , data2::desc AS description;
    
    grpd = GROUP E BY rank_1;
    F = FOREACH grpd {
          sorted = ORDER E BY rank_2;
          GENERATE sorted;
        };
    
    X = FOREACH F GENERATE FLATTEN(BagToTuple(sorted.description));
    DUMP X;
    

    输出:

    (Twelve,TwentyThree,Fourteen,Fifteen,Nine)
    (One,Two,Three,Four,Five)
    (ThirtyFour,Seventeen,Eight)
    

    【讨论】:

      【解决方案2】:

      这是一个 Hive 解决方案:

      --Load the data into Hive
      CREATE TABLE file1 (
        line array<string>
      )
      ROW FORMAT DELIMITED 
      COLLECTION ITEMS TERMINATED BY ',';
      LOAD DATA INPATH '/tmp/test2/file1' OVERWRITE INTO TABLE  file1;
      
      CREATE TABLE file2 (
        name string,
        value string
      )
      ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
      LOAD DATA INPATH '/tmp/test2/file2' OVERWRITE INTO TABLE  file2;
      
      --explode the rows from the first table and create a newid to use for correlation
      CREATE TABLE file1_exploded 
      AS
      WITH tmp 
      AS
      (SELECT RAND() newid, line from file1)
      SELECT newid, item FROM tmp 
      LATERAL VIEW EXPLODE (line) a AS item;
      
      --apply substitions using the second table, then join lines back together
      SELECT CONCAT_WS(',', COLLECT_LIST(value))
      FROM 
      file1_exploded
      JOIN file2 ON item = name
      GROUP BY newid;
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-06-25
        • 1970-01-01
        • 2015-08-30
        • 1970-01-01
        • 2023-04-07
        • 1970-01-01
        • 2018-05-31
        • 1970-01-01
        相关资源
        最近更新 更多