一、import增量导入的官方说明html
二、测试sqoop的increment import数据库
增量导入在企业当中,通常都是须要常常执行的,如隔一个星期就执行一次增量导入,故增量导入的方式须要屡次执行,而每次执行时,又去写相应的执行命令的话,比较麻烦。而sqoop提供了一个很好的工具save job的方式。oracle
测试的方式是经过--incremental来执行 lastmodified 模式, --check-column来设置 LASTMODIFIED检查的字段,意思就是当该字段发生更新或者添加操做,则才会执行导入。--last-value来设置初始值 '2014/8/27 13:00:00',该值是用来做为第一次导入的下界,从第二次开始,sqoop会自动更新该值为上一次导入的上界。工具
测试开始:sqoop建立一个job的方式来实现平常的增量导入,首先在关系型的数据库中oracle穿件一个测试表oracletablename,添加两条数据:oop
select * from oracletablename;测试
id name lastmodifiedthis
1 张三 2015-10-10 17:52:20.0spa
2 李四 2015-10-10 17:52:20.0日志
(1)建立sqoop jobcode
sqoop job --create jobname -- import --connect jdbc:oracle:thin:@192.168.27.235:1521/orcl --username DATACENTER --password clear --table oracletablename --hive-import --hive-table hivetablename --incremental lastmodified --check-column LASTMODIFIED --last-value '2014/8/27 13:00:00'
说明:
1)在上面的job当中,不能指定-m ,由于指定了-m的话,对应的导入会在hdfs上差生相应的中间结果,当你下一次再次执行job时,则会由于output directory is exist 报错。
2)上面的hivetablename必须是已存在的。在第一次导入的时候,为了使得表存在,能够经过将oracletablename的表结构导入到hive中,执行的命令以下:
sqoop create-hive-table --connect jdbc:oracle:thin:@//192.168.27.235:1521/ORCL --username DATACENTER --password clear --table tablename
执行完后,会在hive中建立一个具备相同名字和相同表结构的表。
(2)查看并执行job
上面已经建立了job后,能够经过下面的命令来查看是否已经建立job成功:
sqoop job --list 列出全部的job
sqoop job --show jobname 显示jobname的信息
sqoop job --delete jobname 删除jobname
sqoop job --exec jobname 执行jobname
(3)执行完job后,查看hive中的表是否有数据。固然不出意外确定是有数据的
而且在 执行的过程当中,咱们能够看到对应的执行日志以下:
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/10/12 15:59:37 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 15:59:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* F ROM TEMP2 t WHERE 1=0
15/10/12 15:59:37 INFO tool.ImportTool: Incremental import based on column LASTM ODIFIED
15/10/12 15:59:37 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2014/8/ 27 13:00:00', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 15:59:37 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2015-10 -12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 15:59:37 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 15:59:37 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 15:59:37 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 15:59:37 INFO mapreduce.ImportJobBase: Beginning import of TEMP2
15/10/12 15:59:37 INFO Configuration.deprecation: mapred.jar is deprecated. Inst ead, use mapreduce.job.jar
15/10/12 15:59:37 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 15:59:37 INFO Configuration.deprecation: mapred.map.tasks is deprecated . Instead, use mapreduce.job.maps
15/10/12 15:59:37 INFO client.RMProxy: Connecting to ResourceManager at hadoop3/ 192.168.27.233:8032
15/10/12 15:59:42 INFO db.DBInputFormat: Using read commited transaction isolati on
15/10/12 15:59:42 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN (ID), MAX(ID) FROM TEMP2 WHERE ( LASTMODIFIED >= TO_TIMESTAMP('2014/8/27 13:00:0 0', 'YYYY-MM-DD HH24:MI:SS.FF') AND LASTMODIFIED < TO_TIMESTAMP('2015-10-12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF') )
15/10/12 15:59:42 INFO mapreduce.JobSubmitter: number of splits:4
说明:从上面的红色部分咱们很清楚的知道,sqoop在导入的时候是怎么导入。咱们能够知道设置的--last-value的值就是对应的下界。
(4)在关系数据库oracle中对oracletablename添加一个字段
id name lastmodified
1 张三 2015-10-10 17:52:20.0
2 李四 2015-10-10 17:52:20.0
3 李四 2015-10-12 16:01:23.0
(5)此时进行增量导入
即再一次执行job:sqoop job --exec jobname
再次查看日志的内容以下:
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/10/12 16:02:17 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 16:02:17 INFO manager.SqlManager: Executing SQL statement: SELECT t.* F ROM TEMP2 t WHERE 1=0
15/10/12 16:02:17 INFO tool.ImportTool: Incremental import based on column LASTM ODIFIED
15/10/12 16:02:17 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2015-10 -12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 16:02:17 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2015-10 -12 16:02:15.0', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 16:02:17 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 16:02:17 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 16:02:17 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 16:02:17 INFO mapreduce.ImportJobBase: Beginning import of TEMP2
15/10/12 16:02:17 INFO Configuration.deprecation: mapred.jar is deprecated. Inst ead, use mapreduce.job.jar
15/10/12 16:02:17 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 16:02:17 INFO Configuration.deprecation: mapred.map.tasks is deprecated . Instead, use mapreduce.job.maps
15/10/12 16:02:17 INFO client.RMProxy: Connecting to ResourceManager at hadoop3/ 192.168.27.233:8032
15/10/12 16:02:23 INFO db.DBInputFormat: Using read commited transaction isolati on
15/10/12 16:02:23 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN (ID), MAX(ID) FROM TEMP2 WHERE ( LASTMODIFIED >= TO_TIMESTAMP('2015-10-12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF') AND LASTMODIFIED < TO_TIMESTAMP('2015-10-12 1 6:02:15.0', 'YYYY-MM-DD HH24:MI:SS.FF') )
15/10/12 16:02:23 WARN db.BigDecimalSplitter: Set BigDecimal splitSize to MIN_IN CREMENT
15/10/12 16:02:23 INFO mapreduce.JobSubmitter: number of splits:1
说明:咱们能够从执行的日志中看出,--last-value的值会自动更新为上一次的上界的值,注意看一下上次的上界便可。