【问题标题】:Hive on spark, partition pruning, better undertandingHive on spark,分区修剪,更好的理解
【发布时间】:2016-11-08 17:14:20
【问题描述】:

我有一个使用 SQL/HQL 语言的 spark 1.6.2 代码。 我真的很想了解我的工作是否在进行分区修剪。 数据按日期分区(cdate 字段) 解释计划是:

== Physical Plan ==
Project [coalesce(cdate#74,cdate#38) AS cdate#29,coalesce(account_key#75,account_key#34) AS account_key#30,coalesce(product#76,product#35) AS product#31,(coalesce(amount#77,0.0) + coalesce(amount#36,0.0)) AS amount#32,(coalesce(volume#78L,0) + cast(coalesce(volume#37,0) as bigint)) AS volume#33L]
+- SortMergeOuterJoin [account_key#34,cdate#38,product#35], [account_key#75,cdate#74,product#76], FullOuter, None
   :- Sort [account_key#34 ASC,cdate#38 ASC,product#35 ASC], false, 0
   :  +- TungstenExchange hashpartitioning(account_key#34,cdate#38,product#35,200), None
   :     +- Project [volume#37,product#35,cdate#38,account_key#34,amount#36]
   :        +- BroadcastHashJoin [cdate#38], [cdate#24], BuildLeft
   :           :- Scan ParquetRelation[account_key#34,product#35,amount#36,volume#37,cdate#38] InputPaths: hdfs://hdp1.voicelab.local:8020/apps/hive/warehouse/my.db/daily_profiles
   :           +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
   :              +- TungstenExchange hashpartitioning(cdate#24,200), None
   :                 +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
   :                    +- Project [cdate#24]
   :                       +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#24])
   :                          +- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
   :                             +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#20,accountKey#21,product#22])
   :                                +- Project [cdate#20,accountKey#21,product#22]
   :                                   +- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]
   +- Sort [account_key#75 ASC,cdate#74 ASC,product#76 ASC], false, 0
      +- TungstenExchange hashpartitioning(account_key#75,cdate#74,product#76,200), None
         +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Final,isDistinct=false),(count(1),mode=Final,isDistinct=false)], output=[cdate#74,account_key#75,product#76,amount#77,volume#78L])
            +- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
               +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Partial,isDistinct=false),(count(1),mode=Partial,isDistinct=false)], output=[cdate#20,accountKey#21,product#22,sum#54,count#55L])
                  +- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]

如何确定我的工作是否使用元存储来进行分区修剪。

您能否详细介绍一下 Scan ParquetRelation?我怎么知道使用分区修剪/发现的扫描? 字段#SOME_NUMBER 即 account_key#34 的含义是什么

用例是按日期、帐户、产品聚合数据

【问题讨论】:

    标签: apache-spark hive


    【解决方案1】:

    在物理计划中查找 PartitionFilters: [... ]。如果数组有一个非空值,它使用,否则没有。我在你的计划中找不到,除非我错过了或找不到。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-03-07
      • 2019-10-27
      • 2020-03-09
      • 1970-01-01
      相关资源
      最近更新 更多