【问题标题】:Hive select query return top 100 syntax error?Hive 选择查询返回前 100 个语法错误?
【发布时间】:2019-08-06 15:12:34
【问题描述】:

这是我的 Hive 查询,直接来自 TPC-DS 工具包:

WITH customer_total_return 
     AS (SELECT sr_customer_sk AS ctr_customer_sk, 
                sr_store_sk    AS ctr_store_sk, 
                Sum(sr_fee)    AS ctr_total_return 
         FROM   store_returns, 
                date_dim 
         WHERE  sr_returned_date_sk = d_date_sk 
                AND d_year = 2000 
         GROUP  BY sr_customer_sk, 
                   sr_store_sk) 
SELECT TOP 100 c_customer_id 
FROM   customer_total_return ctr1, 
       store, 
       customer 
WHERE  ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2 
                                FROM   customer_total_return ctr2 
                                WHERE  ctr1.ctr_store_sk = ctr2.ctr_store_sk) 
       AND s_store_sk = ctr1.ctr_store_sk 
       AND s_state = 'TN' 
       AND ctr1.ctr_customer_sk = c_customer_sk 
ORDER  BY c_customer_id; 

但是,我在尝试运行它时收到以下错误:

失败:ParseException 行 11:11 无法识别“TOP”附近的输入 选择目标中的“100”“c_customer_id”

我的理解是 TOP 100 在 HiveQL 中的语法无效。我怎样才能正确地重写它?

【问题讨论】:

  • 使用LIMIT。以及正确的JOIN 语法。

标签: sql hive hiveql tpc


【解决方案1】:

使用 LIMIT 代替 TOP,如下所示:

WITH customer_total_return 
     AS (SELECT sr_customer_sk AS ctr_customer_sk, 
                sr_store_sk    AS ctr_store_sk, 
                Sum(sr_fee)    AS ctr_total_return 
         FROM   store_returns, 
                date_dim 
         WHERE  sr_returned_date_sk = d_date_sk 
                AND d_year = 2000 
         GROUP  BY sr_customer_sk, 
                   sr_store_sk) 
SELECT c_customer_id 
FROM   customer_total_return ctr1, 
       store, 
       customer 
WHERE  ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2 
                                FROM   customer_total_return ctr2 
                                WHERE  ctr1.ctr_store_sk = ctr2.ctr_store_sk) 
       AND s_store_sk = ctr1.ctr_store_sk 
       AND s_state = 'TN' 
       AND ctr1.ctr_customer_sk = c_customer_sk 
ORDER  BY c_customer_id
LIMIT 100; 

【讨论】:

    【解决方案2】:

    这是一个多层次查询的坏例子。我建议:

    WITH customer_total_return AS (
          SELECT sr.sr_customer_sk AS ctr_customer_sk, 
                 sr.sr_store_sk  AS ctr_store_sk, 
                 SUM(sr.sr_fee) AS ctr_total_return,
                 AVG(SUM(sr.sr_fee)) OVER (PARTITION BY sr.sr_store_sk) as avg_store_sr_fee
          FROM store_returns sr JOIN
               date_dim d
               ON sr.sr_returned_date_sk = d.d_date_sk 
          WHERE d_year = 2000 
          GROUP  BY sr_customer_sk, sr_store_sk
         ) 
    SELECT c.c_customer_id 
    FROM customer_total_return ctr JOIN
         store s
         ON s.s_store_sk = ctr.ctr_store_sk JOIN
         customer c
         ON ctr.ctr_customer_sk = c.c_customer_sk
    WHERE ctr.ctr_total_return > 1.2 * avg_store_sr_fee AND
          s.s_state = 'TN'  
    ORDER  BY c.c_customer_id
    LIMIT 100;
    

    注意事项:

    • 从不FROM 子句中使用逗号。 始终使用正确、明确、标准 JOIN 语法。
    • 限定所有列引用,尤其是当查询具有多个表引用时。
    • 不需要计算平均值的子查询。
    • Hive 使用 LIMIT,而不是 TOP

    【讨论】:

    • 这是一个标准化查询,旨在对硬件进行压力测试:)
    • @crystyxn 。 . .它仍然应该写正确。
    • @crystyxn 。 . .早期版本的 Hive 甚至不支持 FROM 子句中的逗号。
    • 我在尝试运行您的查询版本时收到ParseException line 6:27 cannot recognize input near 'sr' '.' 'JOIN' in table source
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-09-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多