分析函数问题-矢量计算时管道越界

时间 2019-11-18

标签分析函数问题矢量计算管道越界繁體版

原文原文链接

报错信息以下：java

Diagnostics report from attempt_1479210500211_159364_m_000003_0: Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row 
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row 
	at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:52)
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
	... 8 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 19
	at org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow.setBatch(VectorExtractRow.java:706)
	at org.apache.hadoop.hive.ql.exec.vector.VectorExtractRowDynBatch.setBatchOnEntry(VectorExtractRowDynBatch.java:34)
	at org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.process(VectorReduceSinkOperator.java:89)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
	at org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:117)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
	at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:164)
	at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:45)
	... 9 more

sql内容：sql

select device_id,
row_number() over(partition by device_id order by action_timestamp) cn
from edw_log.dwd_esf_edw_service_log_di
where dt = '20160519'express

sql执行计划：apache

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dwd_esf_edw_service_log_di
            filterExpr: (dt = '20160519') (type: boolean)
            Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: device_id (type: string), action_timestamp (type: string)
              sort order: ++
              Map-reduce partition columns: device_id (type: string)
              Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
      Execution mode: vectorized
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string)
          outputColumnNames: _col0, _col8
          Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
          PTF Operator
            Function definitions:
                Input definition
                  input alias: ptf_0
                  output shape: _col0: string, _col8: string
                  type: WINDOWING
                Windowing table definition
                  input alias: ptf_1
                  name: windowingtablefunction
                  order by: _col8
                  partition by: _col0
                  raw input shape:
                  window functions:
                      window function definition
                        alias: row_number_window_0
                        name: row_number
                        window function: GenericUDAFRowNumberEvaluator
                        window frame: PRECEDING(MAX)~FOLLOWING(MAX)
                        isPivotResult: true
            Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: string), row_number_window_0 (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

分析执行计划，发现存在矢量化查询模式，以下图：app

Execution mode: vectorized

其余都很正常，问题可能出如今这里，并且报错信息也有不少和矢量化查询有关oop

经过参数调整，关闭矢量化查询功能lua

set hive.vectorized.execution.enabled=false;orm

再次查看执行计划，执行计划发生变化，sql也能够正常执行了hadoop

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dwd_esf_edw_service_log_di
            filterExpr: (dt = '20160519') (type: boolean)
            Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: device_id (type: string), action_timestamp (type: string)
              sort order: ++
              Map-reduce partition columns: device_id (type: string)
              Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string)
          outputColumnNames: _col0, _col8
          Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
          PTF Operator
            Function definitions:
                Input definition
                  input alias: ptf_0
                  output shape: _col0: string, _col8: string
                  type: WINDOWING
                Windowing table definition
                  input alias: ptf_1
                  name: windowingtablefunction
                  order by: _col8
                  partition by: _col0
                  raw input shape:
                  window functions:
                      window function definition
                        alias: row_number_window_0
                        name: row_number
                        window function: GenericUDAFRowNumberEvaluator
                        window frame: PRECEDING(MAX)~FOLLOWING(MAX)
                        isPivotResult: true
            Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: string), row_number_window_0 (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 5578978 Data size: 7566973806 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

具体缘由怀疑是在矢量化过程当中，出现null越界产生错误，具体须要验证input

参考网址

https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution