【问题标题】:How to transpose/pivot data in hive?如何在 hive 中转置/透视数据?
【发布时间】:2014-04-12 02:25:13
【问题描述】:

我知道没有直接的方法可以在 hive 中转置数据。我关注了这个问题:Is there a way to transpose data in Hive?,但由于那里没有最终答案,所以无法一路走下去。

这是我的桌子:

 | ID   |   Code   |  Proc1   |   Proc2 | 
 | 1    |    A     |   p      |   e     | 
 | 2    |    B     |   q      |   f     |
 | 3    |    B     |   p      |   f     |
 | 3    |    B     |   q      |   h     |
 | 3    |    B     |   r      |   j     |
 | 3    |    C     |   t      |   k     |

这里 Proc1 可以有任意数量的值。 ID、Code 和 Proc1 共同构成该表的唯一键。我想透视/转置此表,以便 Proc1 中的每个唯一值成为一个新列,Proc2 中的相应值是该列中对应行的值。从本质上讲,我试图得到类似的东西:

 | ID   |   Code   |  p   |   q |  r  |   t |
 | 1    |    A     |   e  |     |     |     |
 | 2    |    B     |      |   f |     |     |
 | 3    |    B     |   f  |   h |  j  |     |
 | 3    |    C     |      |     |     |  k  |

在新的转换表中,ID 和代码是唯一的主键。从我上面提到的票中,我可以使用 to_map UDAF 做到这一点。 (免责声明 - 这可能不是朝着正确方向迈出的一步,但如果是,请在此提及)

 | ID   |   Code   |  Map_Aggregation   | 
 | 1    |    A     |   {p:e}            |
 | 2    |    B     |   {q:f}            |
 | 3    |    B     |   {p:f, q:h, r:j } |  
 | 3    |    C     |   {t:k}            |

但不知道如何从这一步转到我想要的数据透视表/转置表。 任何有关如何进行的帮助都会很棒! 谢谢。

【问题讨论】:

    标签: hadoop hive


    【解决方案1】:

    这是我使用 hive 的内部 UDF 函数“map”解决此问题的方法:

    select
        b.id,
        b.code,
        concat_ws('',b.p) as p,
        concat_ws('',b.q) as q,
        concat_ws('',b.r) as r,
        concat_ws('',b.t) as t
    from 
        (
            select id, code,
            collect_list(a.group_map['p']) as p,
            collect_list(a.group_map['q']) as q,
            collect_list(a.group_map['r']) as r,
            collect_list(a.group_map['t']) as t
            from (
                select
                  id,
                  code,
                  map(proc1,proc2) as group_map 
                from 
                  test_sample
            ) a
            group by
                a.id,
                a.code
        ) b;
    

    “concat_ws”和“map”是hive udf,“collect_list”是hive udaf。

    【讨论】:

    • 你能解释一下当值是数字格式时你会怎么做。我看到了你的博客,但代码与表格不匹配。 hadoopmania.blogspot.com/2015/12/…
    • 此示例是否适用于您有多个列需要透视的情况?
    • 这是否可以概括为您可能具有除 'p'、'q'、'r'、't' 之外的值的情况?
    【解决方案2】:

    这是我最终使用的解决方案:

    add jar brickhouse-0.7.0-SNAPSHOT.jar;
    CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';
    
    select 
        id, 
        code,
        group_map['p'] as p,
        group_map['q'] as q,
        group_map['r'] as r,
        group_map['t'] as t
        from ( select
            id, code,
            collect(proc1,proc2) as group_map 
            from test_sample 
            group by id, code
        ) gm;
    

    to_map UDF 来自砖房回购:https://github.com/klout/brickhouse

    【讨论】:

    • 我在砖房回购中看不到 to_map UDF。能否请您提供更多关于此的详细信息?
    • 您可以使用“collect” UDAF - 类似于 to_map 链接:github.com/klout/brickhouse/blob/master/src/main/java/… 您应该将“to_map”替换为“collect”我已经用相同的方式更新了解决方案。
    • 嗨!我正在尝试类似的东西。在您的答案中,您有 group_map['p'] 等,表明您提前知道这些值。您如何解决不知道 Proc1 中的值是什么的问题?请分享。谢谢!
    【解决方案3】:

    另一种解决方案。

    枢轴使用Hivemallto_map函数。

    SELECT
      uid,
      kv['c1'] AS c1,
      kv['c2'] AS c2,
      kv['c3'] AS c3
    FROM (
      SELECT uid, to_map(key, value) kv
      FROM vtable
      GROUP BY uid
    ) t
    

    uid c1 c2 c3 101 11 12 13 102 21 22 23

    反透视

    SELECT t1.uid, t2.key, t2.value
    FROM htable t1
    LATERAL VIEW explode (map(
      'c1', c1,
      'c2', c2,
      'c3', c3
    )) t2 as key, value
    

    uid key value 101 c1 11 101 c2 12 101 c3 13 102 c1 21 102 c2 22 102 c3 23

    【讨论】:

      【解决方案4】:

      这段代码我没有写,但是我觉得你可以使用klouts brickhouse提供的一些UDF:https://github.com/klout/brickhouse

      具体来说,您可以使用此处提到的他们的收藏:http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/

      然后使用本文http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra中详述的方法分解数组(它们将具有不同的长度)

      【讨论】:

      • 感谢您的来信。我不需要收集 UDAF,因为它与我已经在这里使用的地图聚合 UDAF 相同。我可以通过使用我的地图聚合中的键名作为新列来做同样的事情,真正的问题是我希望它是动态的 - 即 - 我不知道我最终可能会得到多少不同的“Proc1”值,我想要为每个新的“Proc1”动态创建更多列
      【解决方案5】:
      1. 我使用以下查询创建了一个名为 hive 的虚拟表-

      create table hive (id Int,Code String, Proc1 String, Proc2 String);

      1. 加载表中的所有数据-
      insert into hive values('1','A','p','e');
      insert into hive values('2','B','q','f'); 
      insert into hive values('3','B','p','f');
      insert into hive values('3','B','q','h');
      insert into hive values('3','B','r','j');
      insert into hive values('3','C','t','k');
      
      1. 现在使用下面的查询来实现输出。
      select id,code,
           case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
           case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
           case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
           case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
           from(
                  select id, code,
                  case when proc1 ='p' then proc2 end as p,
                  case when proc1 ='q' then proc2 end as q,
                  case when proc1 ='r' then proc2 end as r,
                  case when proc1 ='t' then proc2 end as t
                  from hive
              ) dummy group by id,code;
      

      【讨论】:

        【解决方案6】:

        如果是数值,您可以使用以下配置单元查询:

        样本数据

        ID  cust_freq   Var1    Var2    frequency
        220444  1   16443   87128   72.10140547
        312554  6   984 7339    0.342452643
        220444  3   6201    87128   9.258396518
        220444  6   47779   87128   2.831972441
        312554  1   6055    7339    82.15209213
        312554  3   12868   7339    4.478333954
        220444  2   6705    87128   15.80822558
        312554  2   37432   7339    13.02712127
        
        select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
        ( select id, 
         map(cust_freq,frequency) as group_map 
         from table
         ) a group by a.id having id in 
        ( '220444',
        '312554');
        
        ID  one two three   six
        220444  72.10140547 15.80822558 9.258396518 2.831972441
        312554  82.15209213 13.02712127 4.478333954 0.342452643
        
        In above example I have't used any custom udf. It is only using in-built hive functions.
        Note :For string value in key write the vale as sum(a.group_map['1']) as One.
        

        【讨论】:

          【解决方案7】:

          对于 Unpivot,我们可以简单地使用以下逻辑。

          SELECT Cost.Code, Cost.Product, Cost.Size
          , Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
          FROM
          (Select Code, Product, Size, State_code, Promo_date, Price as Cost
          FROM Product
          Where Description = 'Cost') Cost
          JOIN
          (Select Code, Product, Size, State_code, Promo_date, Price as Price
          FROM Product
          Where Description = 'Sales') Sales
          on (Cost.Code = Sales.Code
          and Cost.Promo_date = Sales.Promo_date);
          

          【讨论】:

            【解决方案8】:

            下面也是Pivot的一种方式

            SELECT TM1_Code, Product, Size, State_code, Description
              , Promo_date
              , Price
            FROM (
            SELECT TM1_Code, Product, Size, State_code, Description
               , MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
                    ,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
                    ,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
            FROM CS_ME_Spirits_30012018) TmpTbl
            LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;
            

            【讨论】:

            • 如果你能解释你的答案,这将是有帮助的。
            【解决方案9】:

            您可以使用 case 语句和 collect_set 的一些帮助来实现这一点。你可以看看这个。您可以在-http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/查看详细答案

            这里是查询供参考,

            SELECT resource_id,
            CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
            CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
            CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
            CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
            FROM (
            SELECT resource_id,
            CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
            CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
            CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
            CASE WHEN quarter='Q4' THEN amount END AS quarter_4
            FROM billing_info)tbl1
            GROUP BY resource_id;
            

            【讨论】:

              猜你喜欢
              • 2019-02-10
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 2015-03-22
              • 1970-01-01
              • 1970-01-01
              相关资源
              最近更新 更多