Hive学习

时间 2019-11-26

标签 hive 学习栏目 Hadoop 繁體版

原文原文链接

1、Hive索引

为何要建立索引？
　　Hive的索引目的是提升Hive表指定列的查询速度。
　　没有索引时，相似'WHERE tab1.col1 = 10' 的查询，Hive会加载整张表或分区，而后处理全部的rows，可是若是在字段col1上面存在索引时，那么只会加载和处理文件的一部分。与其余传统数据库同样，增长索引在提高查询速度时，会消耗额外资源去建立索引和须要更多的磁盘空间存储索引。
　　Hive 0.7.0版本中，加入了索引。Hive 0.8.0版本中增长了bitmap索引。java

使用：mysql

新建索引：linux

CREATE INDEX user_index（索引名称） ON TABLE user(id)（表名+索引字段） 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH deferred REBUILD;

从新更新索引：sql

ALTER INDEX user_index（索引名称） on user REBUILD

删除索引：数据库

DROP INDEX user_index on user;

查看索引：apache

SHOW INDEX on user;

2、桶

Hive还能够把表或分区，组织成桶。将表或分区组织成桶有如下几个目的：安全

CREATE TABLE bucketed_user(id INT,name String) 
CLUSTERED BY (id) INTO 4 BUCKETS;

分区中的数据能够被进一步拆分红桶，先partitioned by (stat_date string)，再clustered by (id) sorted by(age) into 2 bucketbash

INSERT OVERWRITE TABLE videos_b
PARTITION(year=1999)
SELECT producer,title,string WHERE year=2009 CLUSTER BY title;

3、视图

引入视图机制，用户能够将注意力集中在其关心的数据上（而非所有数据），这样就大大提升了用户效率与用户满意度，并且若是这些数据来源于多个基本表结构，或者数据不只来自于基本表结构，还有一部分数据来源于其余视图，而且搜索条件又比较复杂时，须要编写的查询语句就会比较烦琐，此时定义视图就可使数据的查询语句变得简单可行。定义视图能够将表与表之间的复杂的操做链接和搜索条件对用户不可见，用户只须要简单地对一个视图进行查询便可，故增长了数据的安全性，但不能提升查询效率。maven

基础数据表：ide

create table student(id int, name string, age int, class_id int);
create table classes(id int, class_name string);

视图建立：

create view stu_cla as select a.id, a.name, a.age, a.class_id, b.class_name from student a join classes b on a.class_id=b.id;

视图机制：

视图处理有两种机制，替换式和具化式；

替换式：操做视图时，视图名直接被视图定义给替换掉，结果就变成select * from (select c.name as c_name ,s.name as stu_name from student s,class c where c.id = s.class_id),在提交给mysql执行；

具化式：mysql先获得了视图执行的结果，该结果造成一个中间结果暂时存在内存中。以后，外面的select语句就调用了这些中间结果(临时表)。

4、数据类型

1.struct

structs内部的数据能够经过DOT（.）来存取

使用:

create table stu_test(id int, info struct<name:string, age:int>) row format delimited fields terminated by ',' collection items terminated by ":";
#导入测试数据
1,zhou:30
2,yan:30
3,chen:20
4,li:80
5,wei:18
load data local inpath '/opt/hive-test/stu.txt' overwrite into table stu_test;
select info.name from stu_test;

'FIELDS TERMINATED BY' ：字段与字段之间的分隔符
''COLLECTION ITEMS TERMINATED BY' ：一个字段各个item的分隔符

2.Array

使用：

create table class_test(name string, student_list array<int>) row format delimited fields terminated by ',' collection items terminated by ':';
#数据
一班,1:2:3:4
二班,11:12:13
三班,21:22:23
load data local inpath '/opt/hive-test/class.txt' overwrite into table class_test;
select  student_list[0] from class_test;

3.Map

使用：

create table student_map(id string, info map<string,string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';
#数据
1       chineses:100,english:100,math:110
2       chineses:90,english:91,math:111
load data local inpath '/opt/hive-test/stu_map.txt' overwrite into table student_map;
select info['math'] from student_map;

5、Hive运行

hive脚本的执行方式大体有三种：
1. hive控制台执行；
2. hive -e "SQL"执行；
3. hive -f SQL文件执行；

6、扩展接口

1.cli

hive.cli.print.header:当设置为true时，查询返回结果的同时会打印列名。默认状况下设置为false。

hive.cli.print.current.db:当设置为true时，将打印当前数据库的名字。默认状况下设置为false。

6、自定义函数

1.UDF

maven依赖：

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>2.1.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-common</artifactId>
    <version>2.1.1</version>
</dependency>

自定义函数实现：

package com.qf58.bdp.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.serde2.ByteStream;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
import org.apache.hadoop.hive.serde2.lazy.LazyInteger;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

/**
 * Description:
 *
 * @Author: weishenpeng
 * Date: 2018/1/25
 * Time: 上午 11:44
 */
public class OperationAddUDF extends UDF {
	private final ByteStream.Output out = new ByteStream.Output();

	/**
	 * IntWritable
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public IntWritable evaluate(IntWritable num1, IntWritable num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		return new IntWritable(num1.get() + num2.get());
	}

	/**
	 * DoubleWritable
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public DoubleWritable evaluate(DoubleWritable num1, DoubleWritable num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		return new DoubleWritable(num1.get() + num2.get());
	}

	/**
	 * FloatWritable
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public FloatWritable evaluate(FloatWritable num1, FloatWritable num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		return new FloatWritable(num1.get() + num2.get());
	}

	/**
	 * Text
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public Text evaluate(Text num1, Text num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		try {
			Integer n1 = Integer.valueOf(num1.toString());
			Integer n2 = Integer.valueOf(num2.toString());
			Integer result = n1 + n2;
			out.reset();
			LazyInteger.writeUTF8NoException(out, result);
			Text text = new Text();
			text.set(out.getData(), 0, out.getLength());
			return text;
		} catch (Exception e) {
			return null;
		}
	}
}

添加Jar文件到类路径下：

add jar /opt/hive-test/hive-udf-addUDF.jar;

建立函数addUDF：

create temporary function addUDF as 'com.qf58.bdp.hive.udf.OperationAddUDF';

删除函数addUDF:

drop temporary function if exists add;

函数使用：

select addUDF(id, age) as ddd from student;

7、Hive与依赖环境交互

1.与linux交互命令

格式
在linux的命令前加上!（英文感叹号），以;（英文分号结尾）
例子：

!ls;
!pwd;

2.与hdfs交互命令

格式

hdfs的命令。以 dfs 开头，以英文分号结束。

例子：

dfs -ls /;
dfs -mkdir /hive123;

1. 学习Hive(五)Hive 优化
2. Hive学习之Hive CLI
3. Hive学习1：Hive原理
4. Hive学习
5. hive学习_01
6. HIVE学习
7. Hive 学习
8. Hive HQL学习
9. [Hive DML学习]
10. Hive学习总结
更多相关文章...
• 您已经学习了 XML Schema，下一步学习什么呢？ - XML Schema 教程
• 我们已经学习了 SQL，下一步学习什么呢？ - SQL 教程
• Tomcat学习笔记（史上最全tomcat学习笔记）
• 适用于PHP初学者的学习线路和建议