【问题标题】:Database Query for Getting Max, Min of a column and corresponding values from other columns and Total Record Count from a Single Table in Hive用于从 Hive 中获取单个表的最大值、最小值和其他列的相应值以及总记录数的数据库查询
【发布时间】:2020-05-24 05:00:25
【问题描述】:

我在 Hive 表名中有以下数据集 - PUBLISH

注意PUBLISH中可以有重复记录

DATE    |HOUR|SOURCE|COL_TIMESTAMP              |ID
20200101|14  |A     |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14  |A     |2020-01-01 14:18:53.012 GMT|ID_222
20200101|14  |A     |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14  |A     |2020-01-01 14:18:53.019 GMT|ID_333
20200101|15  |C     |2020-01-01 15:18:53.016 GMT|ID_444
20200102|00  |A     |2020-01-01 15:18:53.016 GMT|ID_444

我想根据特定日期、时间和来源生成以下输出 例如。对于 (DATE=20200101 & HOUR=14 & SOURCE=A),输出应为:

DATE    |HOUR|SOURCE|MIN_TIMESTAMP              |START_ID|MAX_TIMESTAMP              |END_ID|RECORD_CNT
20200101|14  |A     |2020-01-01 14:18:53.012 GMT|ID_222  |2020-01-01 14:18:53.019 GMT|ID_333|3

注意时间戳末尾有“GMT”。 此外,我正在尝试使用 spark java 代码运行查询。 当数据量很大时,请建议一个性能良好的 Hive 查询。

【问题讨论】:

标签: mysql hadoop hive hiveql groupwise-maximum


【解决方案1】:

您应该能够使用子查询来确定给定小时的 MIN 和 MAX 时间戳以及不同行的计数,然后将其连接回主表以获得这些时间的 id 值:

SELECT DISTINCT P.DATE, P.HOUR, P.SOURCE,
       P.MIN_TIMESTAMP, p1.ID AS START_ID,
       P.MAX_TIMESTAMP, p2.ID AS END_ID
       P.COUNT
FROM (
    SELECT DATE, HOUR, SOURCE, 
           MIN(COL_TIMESTAMP) AS MIN_TIMESTAMP,
           MAX(COL_TIMESTAMP) AS MAX_TIMESTAMP,
           COUNT(DISTINCT DATE, HOUR, SOURCE, COL_TIMESTAMP, ID) AS COUNT
    FROM PUBLISH
    WHERE DATE = '20200101'
      AND HOUR = 14
      AND SOURCE = 'A'
    GROUP BY DATE, HOUR, SOURCE
) P
JOIN PUBLISH P1 ON P1.DATE = P.DATE AND P1.HOUR = P.HOUR AND P1.SOURCE = P.SOURCE AND P1.COL_TIMESTAMP = P.MIN_TIMESTAMP
JOIN PUBLISH P2 ON P2.DATE = P.DATE AND P2.HOUR = P.HOUR AND P2.SOURCE = P.SOURCE AND P2.COL_TIMESTAMP = P.MAX_TIMESTAMP

只要您在(DATE, HOUR, SOURCE) 上有一个索引,这应该会很好。

【讨论】:

【解决方案2】:

使用解析函数得到START_ID和LAST_ID,然后聚合:

with PUBLISH as ( --Use your_table instead of this CTE
select stack(6,
'20200101','14','A','2020-01-01 14:18:53.016 GMT','ID_111',
'20200101','14','A','2020-01-01 14:18:53.012 GMT','ID_222',
'20200101','14','A','2020-01-01 14:18:53.016 GMT','ID_111',
'20200101','14','A','2020-01-01 14:18:53.019 GMT','ID_333',
'20200101','15','C','2020-01-01 15:18:53.016 GMT','ID_444',
'20200102','00','A','2020-01-01 15:18:53.016 GMT','ID_444'
) as (DT, HOUR, SOURCE, COL_TIMESTAMP, ID)
)

select DT, HOUR, SOURCE,
       min(COL_TIMESTAMP) as MIN_TIMESTAMP,
       START_ID,
       max(COL_TIMESTAMP) as MAX_TIMESTAMP,
       END_ID,
       sum(case when rn=1 then 1 else 0 end) as RECORD_CNT --unique records have rn=1
 from
     (
      select DT, HOUR, SOURCE, COL_TIMESTAMP, ID,
             first_value(ID) over(partition by DT, HOUR, SOURCE order by COL_TIMESTAMP)      as START_ID, 
             first_value(ID) over(partition by DT, HOUR, SOURCE order by COL_TIMESTAMP desc) as END_ID,
             row_number() over(partition by DT, HOUR, SOURCE, COL_TIMESTAMP, ID)             as rn
        from PUBLISH p
     ) s
 group by DT, HOUR, SOURCE, START_ID, END_ID;

结果:

dt  hour    source  min_timestamp   start_id    max_timestamp   end_id  record_cnt
20200101    14  A   2020-01-01 14:18:53.012 GMT ID_222  2020-01-01 14:18:53.019 GMT ID_333  3
20200101    15  C   2020-01-01 15:18:53.016 GMT ID_444  2020-01-01 15:18:53.016 GMT ID_444  1
20200102    00  A   2020-01-01 15:18:53.016 GMT ID_444  2020-01-01 15:18:53.016 GMT ID_444  1

【讨论】:

    猜你喜欢
    • 2018-10-30
    • 2021-12-18
    • 2012-08-23
    • 2017-01-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-07-23
    • 2018-11-26
    相关资源
    最近更新 更多