为什么蜂巢插入中需要减速器答案

【问题标题】：Why Reducer required in a Hive Insert为什么蜂巢插入中需要减速器
【发布时间】：2021-07-21 22:20:05
【问题描述】：

当我们从 hive 命令行触发 insert into 语句时，问题与 MapReduce 作业的工作有关。将记录插入 hive 表时：由于插入内部 hive 表时不涉及聚合，为什么还要调用 reducer。它应该只是一个映射器工作。这里reducer的作用是什么。

insert into table values (1,1);

INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2021-04-28 10:30:26,487 Stage-1 map = 0%,  reduce = 0%
INFO  : 2021-04-28 10:30:30,604 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.96 sec
INFO  : 2021-04-28 10:30:36,774 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.35 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 350 msec

hive> set hive.merge.mapfiles;
hive.merge.mapfiles=true
hive> set hive.merge.mapredfiles;
hive.merge.mapredfiles=false
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=-1

explain insert into test values (10,14);

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
  Stage-2 depends on stages: Stage-0
  Stage-3
  Stage-5
  Stage-6 depends on stages: Stage-5

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: _dummy_table
            Row Limit Per Split: 1
            Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
            Select Operator
              expressions: array(const struct(10,14)) (type: array<struct<col1:int,col2:int>>)
              outputColumnNames: _col0
              Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
              UDTF Operator
                Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
                function name: inline
                Select Operator
                  expressions: col1 (type: int), col2 (type: int)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  Select Operator
                    expressions: _col0 (type: int), _col1 (type: int)
                    outputColumnNames: i, j
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                    Group By Operator
                      aggregations: compute_stats(i, 'hll'), compute_stats(j, 'hll')
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        sort order: 
                        Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col0 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>), _col1 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>)
      Reduce Operator Tree:
        Group By Operator
          aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-7
    Conditional Operator

  Stage: Stage-4
    Move Operator
      files:
          hdfs directory: true
          destination:<path>
  Stage: Stage-0
    Move Operator
      tables:
          replace: false
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
    Stats Work
      Basic Stats Work:
      Column Stats Desc:
          Columns: i, j
          Column Types: int, int
          Table: db.test.test

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: db.test

  Stage: Stage-5
    Map Reduce
      Map Operator Tree:
          TableScan
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: db.test

  Stage: Stage-6
    Move Operator
      files:
          hdfs directory: true
          destination: <PATH>
          
Time taken: 5.123 seconds, Fetched: 121 row(s)

【问题讨论】：

请提供解释输出。还要检查hive.merge.mapfiles、hive.merge.mapredfiles 属性，还要检查mapreduce.job.reduces。
@leftjoin 我已经更新了解释计划。我只是想知道为什么在角色的情况下需要减速器。
@leftjoin 还有一张桌子。当我为此进行正常插入时。为此，它的说法是“没有减速器运算符”。所以没有减速器= 0。
我猜这可能是统计自动收集。但不确定条件运算符到底是做什么的......它可以检查统计数据是否存在并触发最终表的统计数据收集作业。

标签： hive mapreduce bigdata hiveql hadoop2

【解决方案1】：

您似乎启用了统计信息自动收集：

SET hive.stats.autogather=true;

reducer 正在计算统计数据

Reduce Operator Tree:
        Group By Operator
          aggregations: **compute_stats**(VALUE._col0), compute_stats(VALUE._col1)
          mode: mergepartial

【讨论】：

是：hive.stats.autogather=true。但它应该在插入时也调用其他表的统计信息。正确的。对于其他它不这样做。
对于其他表：“由于没有reduce运算符，reduce任务的数量设置为0”阶段1的Hadoop作业信息：映射器数量：1； reducer 数量：0 2021-04-29 11:05:03,750 Stage-1 map = 0%, reduce = 0% 2021-04-29 11:05:08,043 Stage-1 map = 100%, reduce = 0%,累计 CPU 1.76 秒
@shobhit 是的。这很奇怪。表 DDL 有什么不同吗？两个表在插入之前都包含一些数据还是为空？...尽管 hive.merge.mapredfiles=false 并且在这两种情况下都不应该与现有文件合并...有趣
@shobhit 表 DDL 有什么不同吗？
是的，我可以看到一个区别： ------------ COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} - 这存在于一张表中。表参数：即时使用描述格式检查。