从零自学Hadoop(16)：Hive数据导入导出，集群数据迁移上

时间 2019-11-20

标签自学 hadoop hive 数据导入导出集群迁移栏目 Hadoop 繁體版

原文原文链接

阅读目录

本文版权归mephisto和博客园共有，欢迎转载，但须保留此段声明，并给出原文连接，谢谢合做。html

文章是哥(mephisto)写的，SourceLinknode

序

上一篇，咱们介绍了Hive的表操做作了简单的描述和实践。在实际使用中，可能会存在数据的导入导出，虽然可使用sqoop等工具进行关系型数据导入导出操做，但有的时候只须要很简便的方式进行导入导出便可linux

　下面咱们开始介绍hive的数据导入，导出，以及集群的数据迁移进行描述。git

导入文件到Hive

一：语法
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
二：从本地导入

　　使用"LOCAL"就能够从本地导入github

三：从集群导入

　　将语法中"LOCAL"去掉便可。app

四：OVERWRITE

　　使用该参数，若是被导入的地方存在了相同的分区或者文件，则删除并替换，否者直接跳过。ide

五：实战

　　根据上篇咱们创建的带分区的score的例子，咱们先构造两个个文本文件score_7和score_8分别表明7月和8月的成绩，文件会在后面附件提供下载。工具

　　因为建表的时候没有指定分隔符，因此这两个文本文件的分隔符。oop

　　先将文件放入到linux主机中,/data/tmp路径下。url

导入本地数据
load data local inpath '/data/tmp/score_7.txt' overwrite into table score PARTITION (openingtime=201507);
　　咱们发现001变成了1这是觉得表的那一类为int形，因此转成int了。

　　将score_8.txt 放到集群中
su hdfs
hadoop fs -put score_8.txt /tmp/input
　　导入集群数据
load data inpath '/tmp/input/score_8.txt' overwrite into table score partition(openingtime=201508);

将其余表的查询结果导入表

一：语法

Standard syntax:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

 

Hive extension (multiple inserts):

FROM from_statement

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1

[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] 

[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;

FROM from_statement

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1

[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] 

[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

 

Hive extension (dynamic partition inserts):

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

二：OVERWRITE

　　使用该参数，若是被导入的表或者分区中有相同的内容，则该内容被替换，否者直接跳过。

三：INSERT INTO

　　该语法从0.80才开始支持，它会保持目标表，分区的原有的数据的完整性。

四：实战

　　咱们构造一个和score表结构同样的表score1

create table score1 (

  id                int,

  studentid       int,

  score              double

)

partitioned by (openingtime string);

　　插入数据

insert into table score1 partition (openingtime=201509) values (21,1,'76'),(22,2,'45');

　　咱们将表score1的查询结果导入到score中，这里指定了201509分区。

insert overwrite table score partition (openingtime=201509) select id,studentid,score from score1;

动态分区插入

一：说明

　　原本动态分区插入属于将其余表结果插入的内容，可是这个功能实用性很强，特将其单独列出来阐述。该功能从Hive 0.6开始支持。

二：参数

　　动态分区参数会在该命令生命周期内有效，因此通常讲修改的参数命令放在导入以前执行。

Property Default Note

hive.error.on.empty.partition false Whether to throw an exception if dynamic partition insert generates empty results

hive.exec.dynamic.partition false Needs to be set to true to enable dynamic partition inserts

hive.exec.dynamic.partition.mode strict In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions, in nonstrict mode all partitions are allowed to be dynamic

hive.exec.max.created.files 100000 Maximum number of HDFS files created by all mappers/reducers in a MapReduce job

hive.exec.max.dynamic.partitions 1000 Maximum number of dynamic partitions allowed to be created in total

hive.exec.max.dynamic.partitions.pernode 100 Maximum number of dynamic partitions allowed to be created in each mapper/reducer node

三：官网例子

　　咱们能够下看hive官网的例子
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt
　　在这里country分区将会根据pva.cut的值，被动态的建立。注意，这个分区的名字是没有被使用过的，在nonstrict 模式，dt这个分区也能够被动态建立。

四：实战

　　咱们先清空score表的数据（3个分区）
insert overwrite table score partition(openingtime=201507,openingtime=201508,openingtime=201509) select id,studentid,score from score where 1==0;
　　将7月8月数据插入到score1
load data local inpath '/data/tmp/score_7.txt' overwrite into table score1 partition(openingtime=201507);
load data local inpath '/data/tmp/score_8.txt' overwrite into table score1 partition(openingtime=201508);
　　

　　设置自动分区等参数
set  hive.exec.dynamic.partition=true;   
set  hive.exec.dynamic.partition.mode=nonstrict;   
set  hive.exec.max.dynamic.partitions.pernode=10000; 
　　将score1的数据自动分区的导入到score
insert overwrite table score partition(openingtime) select id,studentid,score,openingtime from score1;
　　图片

将SQL语句的值插入到表中

一：说明

　　该语句能够直接将值插入到表中。

二：语法

Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
 
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal

三：官网例子

CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))
  CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
 
INSERT INTO TABLE students
  VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
 
 
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING)
  PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;
 
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')
  VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
 
INSERT INTO TABLE pageviews PARTITION (datestamp)
  VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');

四：实战

　　在将其余表数据导入到表中的例子中，咱们新建了表score1，而且经过SQL语句将数据插入到score1中。这里就只是将上面的步骤从新列举下。

　　插入数据

insert into table score1 partition (openingtime=201509) values (21,1,'76'),(22,2,'45');

--------------------------------------------------------------------

　　到此，本章节的内容讲述完毕。

模拟数据文件下载

Github https://github.com/sinodzh/HadoopExample/tree/master/2016/hive%20test%20file

系列索引

　　【源】从零自学Hadoop系列索引

本文版权归mephisto和博客园共有，欢迎转载，但须保留此段声明，并给出原文连接，谢谢合做。

文章是哥(mephisto)写的，SourceLink

Property	Default	Note
hive.error.on.empty.partition	false	Whether to throw an exception if dynamic partition insert generates empty results
hive.exec.dynamic.partition	false	Needs to be set to `true` to enable dynamic partition inserts
hive.exec.dynamic.partition.mode	strict	In `strict` mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions, in `nonstrict` mode all partitions are allowed to be dynamic
hive.exec.max.created.files	100000	Maximum number of HDFS files created by all mappers/reducers in a MapReduce job
hive.exec.max.dynamic.partitions	1000	Maximum number of dynamic partitions allowed to be created in total
hive.exec.max.dynamic.partitions.pernode	100	Maximum number of dynamic partitions allowed to be created in each mapper/reducer node