Hive深刻使用

时间 2019-11-12

标签 hive 深刻使用栏目 Hadoop 繁體版

原文原文链接

一、HiveServer2和beeline　　-->JDBC接口前端

　　1)bin/Hiveserver2python

　　　　bin/beeline正则表达式

　　　　!connect jdbc:hive2://localhost:10000 user passwd org.apache.hive.jdbc.HiveDriversql

　　2）bin/beeline -u jdbc:hive2://localhost:10000/databaseapache

　　3)JDBC方式并发

　　　　用于将分析的结果存储在HIVE表（result），前端经过DAO代码，进行数据查询　　-->JDBC并发有些问题，须要处理app

二、Hive中常见的数据压缩jvm

　　1）安装snappy　　yum -y install snappy snappy-develoop

　　2）编译hadoop源码并支持snappy测试

　　　　　　mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy

　　　　　　/opt/moduels/hadoop-2.5.0-src/target/hadoop-2.5.0/lib/native目录进行替换

　　3)设置参数测试，bin/yarn jar share/... wordcount -Dcompress=true -Dcodec=snappy inputpath outputpath

三、Hive数据存储

　　1）数据存储格式　　　　-->指的是文件在磁盘中的存储方式（不是表的格式）

　　　　　　按行存储　　SEQUENCEFILE(序列化）TEXTFILE(默认)

　　　　　　按列存储　　RCFILE　　ORC　　PARQUET

　　2)压缩　　stored as orc tblproperties ("orc.compress="SNAPPY");　　　　-->注意大写

四、Hive的优化

　　1）FetchTask　　-->直接抓取而不通过MapReduce

　　　　　　设置hive.fetch.task.conversion=more

　　2)大表的拆分为子表　　-->经过create as语句来建立子表

　　3）外部表、分区表

　　　　外部表　　-->多个项目分析同一数据，存储路径一般需特殊指定

　　　　分区表　　-->按照时间进行分区，可多级分区

　　4）数据格式：存储方式和压缩

　　5）SQL语句的优化，join,filter　　-->set hive.auto.convert.join=true;

　　　　　　Common/Shuffle/Reduce Join　　-->链接的阶段发生在Reduce Task，大表对大表

　　　　　　Map Join　　-->链接阶段发生在Map Task，大表对小表

　　　　　　　　大表的数据从文件中读取

　　　　　　　　小表的数据放在内存中　　-->经过DistributedCache类实现

　　　　　　SMB Join　　-->Sort-Merge-Bucket Join　　-->hive.auto.convert.sortmerge.join，hive.optimize.bucketmapjoin，hive.optimize.bucketmapjoin.sortedmerge

　　　　　　　　在建立表时clustered by() into num buckets　　-->建立表时定义分区平均分配在num个buckets中，

　　　　　　　　　　每一个buckets中的数据按照clustered的字段进行partition和sort

　　　　　　　　在join时按照buckets进行join

　　6)Hive的执行计划

　　　　　　explain [extended|dependency|authorization] SQL语句;

　　7)Hive的并行执行　　-->对于没有依赖关系的job能够并行执行

　　　　hive.exec.parallel.thread.number(<20)　　hive.exec.parallel

　　8)jvm重用　　-->Map Task/Reduce Task运行在jvm中，不需重启，在一个jvm中运行

　　　　mapreduce.job.jvm.numtasks(不要设置太大,<9）

　　9)reduce数目　　-->mapreduce.job.reduces

　　10)推测执行　　　　-->数据倾斜有任务执行时间较长，apm默认推测此任务出现问题，另启一个任务进行执行，以先执行完毕的结果为准，使用SQL语句时将其关闭

　　　　hive.mapred.reduce.tasks.speculative.execution(默认为true)

　　　　mapdreduce.map.speculative　　mapreduce.reduce.speculative

　　11）Map数目　　-->hive.merge.size.per.task(依据块的大小来设置)

　　12)动态分区的调整　　　　-->实现分区表的自动分区

　　13）SQL语句的检查：nonstrict,strict　　set hive.mapred.mode;

五、Hive实战案例

　　1）日志分析

　　　　a.建立源表

　　　　　　不规则源数据　　-->采用正则表达式分析　　or　　使用mapreduce进行数据预处理

　　　　　　　　create table tablename()

　　　　　　　　　　row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'　-->序列化反序列化类

　　　　　　　　　　with serdeproperties("input.regex"="正则表达式","output.format.string"="表达式")

　　　　　　　　　　stored as textfile;

　　　　　　(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")

create table if not exists bf_log_src(remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,request_body string,http_referer string, http_user_agent string,http_x_forwarded_for string,host string)row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties("input.regex"="(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")") stored as textfile;　　//contrib类在建立源表并导入数据时没有问题，可是在将数据加载到子表中时却会致使maptask失败！不要用这个类！

　　　　b.针对不一样业务建立不一样的子表

　　　　　　数据的存储格式处理-->orcfile/parquet

　　　　　　数据压缩

　　　　　　map输出的中间结果集进行数据压缩　　　　-->snappy

　　　　　　使用外部表（并建立分区表）

 > create table if not exists bf_log_comm(       
 　　　　　> remote_addr string,
         > time_local string,
         > request string,
         > http_referer string)
         > row format delimited fields terminated by '\t'
         > stored as orc tblproperties ("orc.compress"="SNAPPY");
>insert into table bf_log_comm select r,t,r,h from bf_log_src;
>select * from bf_log_comm limit 5;

　　　　c.进行数据清洗

　　　　　　自定义UDF对源表数据进行处理

　　　　　　　　第一个UDF：去除引号

　　　　　　　　第二个UDF：转换时间日期

　　　　d.SQL语句进行数据分析

　　　　　　desc function extended substring;

　　　　分析统计按时间段（小时）分组浏览人数降序排序

select t.hour,count(*) cnt from
         > (select substring(time_local,9,2) hour from bf_log_comm) t
         > group by t.hour
         > order by cnt desc;

　　　　分析统计ip地域，（应使用UDF进行预处理提取ip前两个字段）

select t.prex_ip,count(*) cnt from
(select substring(remote_addr,1,7) prex_ip from bf_log_comm) t
group by t.prex_ip
order by cnt desc
limit 5;

　　　　e.使用python脚本进行数据分析

　　　　　　https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MovieLensUserRatings