Impala 表使用 SequenceFile 文件格式（翻译）

时间 2019-11-10

标签 impala 使用 sequencefile 文件格式翻译栏目 Hadoop 繁體版

原文原文链接

Impala 表使用 SequenceFile 文件格式

Cloudera Impala 支持使用 SequenceFile 数据文件。 html

参加如下章节了解 Impala 表使用 SequenceFile 数据文件的详情： shell

建立 SequenceFile 表并加载数据

假如你没有使用已有的数据文件，请先建立一个合适格式的文件。 apache

建立 SequenceFile 表： app

在 impala-shell 中，执行相似命令： oop

create table sequencefile_table (column_specs) stored as sequencefile;

由于 Impala 能够查询一些目前它没法写入数据的表，当建立特定格式的表以后，你可能须要在 Hive shell 中加载数据。参见 Impala 如何使用 Hadoop 文件格式了解详细信息。当经过 Hive 或其余 Impala 以外的机制加载数据以后，在你下次链接到 Impala 节点时，在执行关于这个表的查询以前，执行 REFRESH table_name 语句，以确保 Impala 识别到新添加的数据。性能

例如，下面是你如何在 Impala 中建立 SequenceFile 表(经过显式设置列，或者克隆其余表的结构)，经过 Hive 加载数据，而后经过 Impala 查询： ui

$ impala-shell -i localhost
[localhost:21000] > create table seqfile_table (x int) stored as seqfile;
[localhost:21000] > create table seqfile_clone like some_other_table stored as seqfile;
[localhost:21000] > quit;

$ hive
hive> insert into table seqfile_table select x from some_other_table;
3 Rows loaded to seqfile_table
Time taken: 19.047 seconds
hive> quit;

$ impala-shell -i localhost
[localhost:21000] > select * from seqfile_table;
Returned 0 row(s) in 0.23s
[localhost:21000] > -- Make Impala recognize the data loaded through Hive;
[localhost:21000] > refresh seqfile_table;
[localhost:21000] > select * from seqfile_table;
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.23s

SequenceFile 表启用压缩

你可能但愿对已有的表启用压缩。启用压缩大多数状况下能提升性能提高，而且 SequenceFile 表支持压缩。例如，启用 Snappy 压缩，你须要经过 Hive shell 加载数据时设置如下附加设置： spa

hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> insert overwrite table new_table select * from old_table;

假如你转换分区表，你必须完成额外的步骤。这时候，相似下面指定附加的设置： .net

hive> create tablenew_table(your_cols) partitioned by (partition_cols) stored asnew_format;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> insert overwrite table new_table partition(comma_separated_partition_cols) select * from old_table;

请记住 Hive 不须要你设置源格式。考虑转换一个包含年和月两个分区列的分区表到采用 Snappy 压缩的 SequenceFile 格式，结合以前所述的组件来完成这个表的转换，你应当相似下面指定设置： code

hive> create table TBL_SEQ (int_col int, string_col string) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq SELECT * FROM tbl;

为了对分区表完成相似的处理，你应当相似下面指定设置：

hive> CREATE TABLE tbl_seq (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq PARTITION(year) SELECT * FROM tbl;

Note:

使用下面命令设置压缩类型：

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

你能够在这里选择替代的编解码器如 GzipCodec。