【问题标题】:hive explode with sequence number蜂巢爆炸与序列号
【发布时间】:2014-06-10 05:31:36
【问题描述】:

数据如下:

col1  Col2  pathstr
3   5   some_string_a> some_string_b>some_string_c
8   6   some_string_d> some_string_e>some_string_f

第三列“pathstr”是有顺序的路径数据。我的用户爆炸功能如下:

SELECT col1, col2, path,
FROM table_paths
LATERAL VIEW explode(split(pathstr,'>')) subView as path;

得到以下结果:

3 5 some_string_a
3 5 some_string_b
3 5 some_string_c
8 6 some_string_d
8 6 some_string_e
8 6 some_string_f

但是,展开的数据会丢失路径字符串的顺序信息。我想知道我是否可以生成一个额外的“序列”列,如下所示。或者有更好的方法来做到这一点。

3 5 some_string_a, 1
3 5 some_string_b, 2
3 5 some_string_c, 3
8 6 some_string_d, 1
8 6 some_string_e, 2
8 6 some_string_f, 3

【问题讨论】:

标签: hive explode


【解决方案1】:

你可以使用 row_number() 或 rank() 或 dense_rank()

SELECT col1, col2, row_number(t.path) over(partition by col1, col2)
FROM
(SELECT col1, col2, path,
FROM table_paths
LATERAL VIEW explode(split(pathstr,'>')) subView as path) t 

【讨论】:

    【解决方案2】:

    您可以使用posexplode。它以两列展开,数组中的位置和值。

    配置单元查询示例:

    hive> SELECT a.col1, a.col2, b.path, b.pos
    > FROM (
    >     SELECT 3 col1, 5 col2, 
    >         "some_string_a> some_string_b>some_string_c" pathstr
    >     UNION ALL
    >     SELECT 8 col1, 6 col2, 
    >         "some_string_d> some_string_e>some_string_f" pathstr
    > ) a
    > LATERAL VIEW POSEXPLODE(split(pathstr,'>')) b as pos, path
    > ;
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201708181020_16679, Tracking URL = /jobdetails.jsp?jobid=job_201708181020_16679
    Kill Command = /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job  -kill job_201708181020_16679
    Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
    2017-08-19 07:48:31,023 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.5 sec
    MapReduce Total cumulative CPU time: 1 seconds 500 msec
    Ended Job = job_201708181020_16679
    MapReduce Jobs Launched: 
    Job 0:  Cumulative CPU: 1.5 sec   MAPRFS Read: 264 MAPRFS Write: 80 SUCCESS
    Total MapReduce CPU Time Spent: 1 seconds 500 msec
    OK
    3   5   some_string_a   0
    3   5    some_string_b  1
    3   5   some_string_c   2
    8   6   some_string_d   0
    8   6    some_string_e  1
    8   6   some_string_f   2
    Time taken: 327.33 seconds, Fetched: 6 row(s)
    

    【讨论】:

    • 为什么这被否决了?它似乎正是 OP 需要的(当然是我需要的......)
    猜你喜欢
    • 1970-01-01
    • 2015-07-08
    • 1970-01-01
    • 1970-01-01
    • 2023-03-06
    • 2021-10-20
    • 2022-01-06
    • 1970-01-01
    • 2021-12-12
    相关资源
    最近更新 更多