sqoop导入数据到hive

时间 2019-11-18

标签 sqoop 导入数据 hive 栏目 Hadoop 繁體版

原文原文链接

1.1hive-import参数java

使用--hive-import就能够将数据导入到hive中，可是下面这个命令执行后会报错，报错信息以下：mysql

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password 123456 --table person -m 1 --hive-importsql

16/07/22 02:22:58 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://192.168.223.129:9000/user/root/person already exists
    at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
    at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:562)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)

报错是由于在用户的家目录下已经存在了一个person目录。apache

缘由是由于sqoop导数据到hive会先将数据导入到HDFS上，而后再将数据load到hive中，最后吧这个目录再删除掉。当这个目录存在的状况下，就会报错。oop

1.2target-dir参数来指定临时目录spa

为了解决上面的问题，能够把person目录删除掉，也能够使用target-dir来指定一个临时目录code

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password 123456 --table person -m 1 --hive-import --target-dir temporm

执行完成以后，就能够看到在hive中的表了blog

hive> select * from person;
OK
1    zhangsan
2    LISI

1.3hive-overwrite参数hadoop

若是上面的语句执行屡次，那么会产生这个表数据的屡次拷贝

执行三次以后，hive中的数据是

hive> select * from person;
OK
1    zhangsan
2    LISI
1    zhangsan
2    LISI
1    zhangsan
2    LISI
Time taken: 2.079 seconds, Fetched: 6 row(s)

在hdfs中的表现是：

hive> dfs -ls /user/hive/warehouse/person;
Found 3 items
-rwxrwxrwt   3 18232184201 supergroup         18 2016-07-22 17:48 /user/hive/warehouse/person/part-m-00000
-rwxrwxrwt   3 18232184201 supergroup         18 2016-07-22 17:51 /user/hive/warehouse/person/part-m-00000_copy_1
-rwxrwxrwt   3 18232184201 supergroup         18 2016-07-22 17:52 /user/hive/warehouse/person/part-m-00000_copy_2

若是想要对这个表的数据进行覆盖，那么就须要用到--hive-overwrite参数

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password 123456 --table person --hive-import --target-dir temp -m 1 --hive-overwrite

1.4fields-terminated-by

当吧mysql中的数据导入到hdfs中，默认使用的分隔符是逗号

当吧数据导入到hive中，默认使用的是hive表的默认的字段分割符

Storage Desc Params:          
    field.delim             \u0001              
    line.delim              \n                  
    serialization.format    \u0001

若是想要改变默认的分隔符，能够使用--fields-terminated-by参数

这个参数在第一次导入hive表的时候决定表的默认分隔符

如今吧hive中的表删除掉，而后从新导入

sqoop import --connect jdbc:mysql://localhost:3306/test --username root--password 123456--table person -m 1 --hive-import --fields-terminated-by "|"

再次查看hive表的分隔符：

Storage Desc Params:          
    field.delim             |                   
    line.delim              \n                  
    serialization.format    |