hive编程指南-笔记-1

注：《hive实战 practical hive a guide to hadoop's data warehouse system 》如下简称 hive实战也有一些被加入到其中
第二章基础操做
2.7 命令行界面（千万注意那些是在命令行输入的命令，那些是在hive界面输入的，后面备注命令行输入就是命令行输入其余是hive里执行的）
2.7.1 CLI选项
hive --help --service cli 命令行输入
2.7.2 变量和属性
四种命名空间：hivevar 用户自定义变量 hiveconf hive相关配置属性 system java定义的配置属性 env shell环境定义的环境变量
set env:HADOOP_HOME; -- 查看环境变量好比HADOOP_HOME 能够替换 HIVE_HOME 前面env 是命名空间
set显示上面四种命名空间的环境变量 set -v 显示hadoop 的属性

java

hive --define foo=bar  命令行输入
        hive> set  foo ;    --查看foo
              set hivevar:foo;  -- 将foo  给命名空间 hivevar (猜的)
              set hivevar:foo=bar2; -- 改变foo 值
              set hivevar:foo;
            应用:create  table  toss1(i int, ${hivevar:foo} string);  create  table  toss2(i int ,${foo} string); 
            set hivevar:foo=900920;
            select  * from stock_basic  where    stock_id=${hivevar:foo}  limit 100;

        设置默认显示hive数据库的名字
            hive --hiveconf hive.cli.print.current.db=true;
            或者:  set  hiveconf:hive.cli.print.current.db=true;

            hive --hiveconf y=900920;  或者 set  hiveconf:y=900920;
            select  * from stock_basic  where    stock_id=${hiveconf:y}  limit 100;

        system  可读可写，可是env  只能读不能写
            set system:user.name;
            set system:user.name=hadoop;  # 没问题

            set env:HOME;
            set env:HOME=/home; #会报错。

    2.7.3 hive 中一次使用命令:执行一个或者多个语句，以后当即退出hive 窗口

        hive -e   " select  * from stock_basic  where    stock_id=900920  limit 100";  #命令行执行
        也能够重定向  hive -e   " select  * from stock_basic  where    stock_id=900920  limit 100" > /tmp/stock_id.txt        
        应用:能够用来查看记忆模糊的属性
            hive  -S -e   "set" | grep warehouse  # -S  表明静默模式 会去掉 ok time taken 等字段  其实加不加都无所谓。不加 这些无关字段也不出现
    2.7.4 从脚本中执行hsql
        hive -f    /tmp/hsql.hql  #命令行执行;
        或者 在hive 里面 执行 source  /tmp/hsql.hql 

    2.7.5 hiverc文件 
    在$HOME 下 新建 .hiverc 文件 每次开发 cli 都会先去执行 hiverc文件里的语句，这样将系统变量和其余属性加入到这个文件，默认执行就方便多了
    2.7.6 补全 tab
    2.7.7  查看历史  hive 会将行命令记录到$home/.hivehistory 下 也能够上下键查看
    2.7.8 执行 shell命令  ! pwd
    2.7.9 执行 hadoop dfs命令   dfs -ls /;
    2.7.10 注释  --       
    2.7.11 设置默认显示表的字段的名字                
        set hive.cli.print.header=true;

第三章数据类型和文件格式node

3.1基本数据类型
    hive数据类型基本上都是对java中的接口实现。因此类型行为细节和java中的类型彻底一致
3.2集合数据类型：struct,map,array 
3.3 文本文件数据编码(分隔符)：ctrl+A  ctrl+B  ctrl+C  以及换行 \n 其中 换行 是行数据的分隔符。 
3.4读时模式:已理解 不赘述

第四章数据定义
4.1 hive中的数据库
正则表达式

查看全部数据库  show databases;     或者
    show databases like  'financials*';
    建立数据库 
    create  database if not exists  financials4  comment 'all financials  table   '   
    location  'hdfs://master:9000/hive_dir/financials4.db'
    with dbproperties('creator'='zhangyt','date'='2020-10-08');     
    #hdfs://master:9000 是hadoop 默认设置的路径 见core-site.xml。  hive_dir 是hive设置的 hive 在hdfs上的路径 见 hive-site.xml

    describe  database  financials2;  或者 describe  database  extended  financials2;    #查看更详细信息  主要看第三种建数据库的方式的信息

    使用数据库   use  financials;
    删除数据库  drop   database financials2; #没有表能够这样删除     drop   database  financials2 cascade;   #有表必须加 cascade
4.2 修改数据库 只能修改属性信息 元数据信息不能修改 
    alter database financials3 set dbproperties('edited-by'='zhangyt1');
4.3 建立表 
    create table  stand_table_1  like  stand_table;#注意表属性：若是原表是分区表 目标表也会是分区表
    create [external] table  [if not exists ]stand_table(stand_a string comment 'A 列 ',stand_b int  comment 'column b')  
    comment  'this is a stand table '
    row format delimited    fields  terminated  by   '\t'
    lines terminated by '\n' 
    stored as  textfile
    location '/hive_dir/stand_table'
    #row format delimited   必需要在 其余子句以前
    # location 能够指定别的路径 不指定就默认 到 hive-site.xml 路径去了
    # 若是有 location  和 stored as  textfile 则 stored as  textfile 必需要在 location 前面
    #        tblproperties('creator'='zhangyt','date'='2020-10-08')加不进去

    查看表  show tables; 或者   show tables  in   financials2 ;  # in后面是数据库   show tables 'stock*'    ; #模糊查询 
    describe      stand_table;    
    describe   extended  stand_table;

    load   data  [local]   inpath '/tmp/hive_txt/ss.txt'   [overwrite] into  table ss_l;  
    load data    inpath '/input/ss.txt'   overwrite into  table ss;   
    # 这样写 这个地址 是 hdsf 地址  注意这是移动，移动以后 原来的目录下的文件就被移动到 hive  的 文件目录下去了  

    load   data  local   inpath '/tmp/hive_txt/ss.txt'   overwrite into  table ss_l;  
    # 这样写 地址 是本地地址  虽然是移动，可是 本地文件还在

    之因此会有上面文件存在于不存在的差别 是由于 hdfs上已是分布式式文件系统了，不须要多份拷贝
4.4分区表，管理表
    create table   if not exists  stock_basic_partition(
    stock_name  string  comment '股票名称'  
    ,stock_date  string   comment '股票日期'
    ,stock_start_price  DECIMAL(15,3)   comment '开盘价'
    ,stock_max_price   DECIMAL(15,3)   comment '最高价'
    ,stock_min_price   DECIMAL(15,3)   comment '最低价'
    ,stock_end_price   DECIMAL(15,3)   comment '收盘价'
    ,stock_volume   DECIMAL(15,3)   comment '成交量'
    ,stock_amount   DECIMAL(15,3)   comment '成交金额')  
    comment  'stock_ basic infomation '
    partitioned by   (stock_id  string )
    row format delimited    fields  terminated  by   ','
    lines terminated by '\n' 
    stored as  textfile
    location '/hive_dir/stock_basic_partition';
    注：hive实战中的：hive分区对于特别的子查询能够改进其性能,能够对不须要的查询结果的分区进行剪枝：该过程称为分区消除
    分区遵照的原则：（暂时不理解）
    挑选一列做为分区键，其惟一值个数应该在低值和中间值之间
    避免分区小于1GB
    当分区数量较多，调整hiveserver2和hive  metastore的内存
    当时用多列做为分区键的时候，对于每个分区键的组合都要建立一个子目录的嵌套树。应该避免深刻嵌套，由于这会致使太多的分区
    当使用hive流处理插入数据，若是多个会话向相同的分区写入数据会致使锁闭。流处理参见6.5:hive流处理：hive的流的api主要做为hive bolt与storm一块儿使用。 暂时没用到。。

    #表的 备注必须放在 partitioned by   (stock_id  string ) 前面 

    查看分区  show partitions  stock_basic_partition;       describe    extended  stock_basic_partition;
    #注意格式 分区字段要放最后 
    insert  overwrite  table    stock_basic_partition partition ( stock_id ) select  
     stock_name                
    ,stock_date                
    ,stock_start_price         
    ,stock_max_price           
    ,stock_min_price           
    ,stock_end_price           
    ,stock_volume              
    ,stock_amount 
    ,stock_id       
    from  stock_basic  
    limit  200000;
    要在非严格模式下才行

    load data  local inpath   '/tmp/stock_id.txt'    into   table   stock_basic_partition   partition (stock_id='900920');

    设置严格模式 使得 分区表必须加分区  set hive.exec.dynamic.partition.mode=strict ;  设置非严格模式 set   hive.exec.dynamic.partition.mode=nonstrict ;
    严格模式下 分区表也能够不用指定分区。 
4.5 删除表 drop table  stock_basic_partition; 外部表 只是删除元数据 真实数据文件还在  
4.6 修改表
特别注意 修改表 只是修改元数据 可是真实数据并没该表 须要本身对应修改真实数据文件
    修改表名 alter   table   stock_basic_test  rename  to  stock_basic_test1;
    增长分区  alter  table stock_basic_partition add  if not exists   partition (stock_id='00000001') location '/hive_dir/stock_basic_partition/00000001' 
              partition (stock_id='03000001') location '/hive_dir/stock_basic_partition/03000001';
    删除分区    分区 这个语句暂时发现只能删除一个分区
    alter  table stock_basic_partition drop   if   exists   partition (stock_id='00000001') ;
    更改列的位置  报错 不行
    alter  table  stock_basic_test  CHANGE COLUMN stock_amount  DECIMAL(15,3) comment  '成交金额' AFTER  stock_id ;

    增长列 
    alter  table  stock_basic_test add  COLUMNS (  stock_other string  COMMENT '其余信息');
    删除列/替换列 
    将 stock_volume 删除了     就是将不要的字段删掉，可是保证删掉后的数据 往前挪的时候类型正确。并且数据也要删除 
    综上 其实删除列没什么用 太麻烦。
    alter  table   stock_basic_test replace  columns (
    stock_id string 
    ,stock_name  string  comment '股票名称'  
    ,stock_date  string   comment '股票日期'
    ,stock_start_price  DECIMAL(15,3)   comment '开盘价'
    ,stock_max_price   DECIMAL(15,3)   comment '最高价'
    ,stock_min_price   DECIMAL(15,3)   comment '最低价'
    ,stock_end_price   DECIMAL(15,3)   comment '收盘价'
     ,stock_amount   DECIMAL(15,3)   comment '成交金额');

    create table  stock_basic_test as select  * from  stock_basic  limit  2000;

    修改表属性 
    alter  table    stock_basic_test set tblproperties('notes'='hahaha');
    附：《hive实战 practical hive  a guide to hadoop's data  warehouse system 》如下简称 hive实战 4.4.6 表的属性中
    重要的表的属性：last_modified_user,last_modified_time, immutable,orc.compress,skip.header.line.count 
    1.immutable 使用： 当该属性被设置为true 则若是表不为空，没法插入数据。
        create table  stock_basic_test_immutable  like   stock_basic   ; 
        insert into  stock_basic_test_immutable select * from stock_basic limit 100;
        alter  table    stock_basic_test_immutable set tblproperties('immutable'='true');
        ## 试试是否能够插入
        insert into  stock_basic_test_immutable select * from stock_basic limit 100;  ## 失败
        alter  table    stock_basic_test_immutable set tblproperties('immutable'='false');
        insert into  stock_basic_test_immutable select * from stock_basic limit 100;  ## 成功
    2.skip.header.line.count 
        create table  stock_basic_test_skip(a string ,b string) row format delimited    fields  terminated  by   ',' lines terminated by '\n'  ; 
        alter  table    stock_basic_test_skip set tblproperties('skip.header.line.count'='1');
        load  data  local     inpath   '/tmp/stock_basic_test_skip.txt'  overwrite  into table     stock_basic_test_skip;   #实验可行
    skip.header.line.count 去除数据的表头：是hive外部表的重要特性之一

    修改存储属性 
    alter  table  stock_basic_test  set  fileformat sequencefile ;  sequencefile  能够替换成   textfile  
    alter  table  stock_basic_partition  partition (stock_id='600909' )set  fileformat sequencefile ; 

        4.6.8众多修改表语句 
        钩子 看不懂（略去）
        将分区文件（只能用于分区）成一个 hadoop压缩包(har文件)  能够下降文件数 从而减小namenode压力可是不会减小 压缩空间            
        前提： 开启模式 set hive.archive.enabled=true; 
        若是报错 java.lang.NoClassDefFoundError: org/apache/hadoop/tools/HadoopArchives 
        须要将 hadoop的lib目录下的  hadoop-archives-3.1.2.jar 复制到hive的lib目录下便可  
        alter  table   stock_basic_partition  archive  partition ( stock_id ='600908');
        反之
        alter  table   stock_basic_partition  unarchive  partition ( stock_id ='600908');
        压缩后能够进去hdfs页面看hive 目录下该文件的方式，有点意思

        保护分区 防止被删除 和查询 报错 下次看 百度下无法解决 
        alter  table  stock_basic_partition partition( stock_id ='600908')enable no_drop;

        alter  table  stock_basic_partition partition ( stock_id ='600908')  enable   offline ;

第五章数据操做
5.1向管理表中装载数据
注意怎么使用环境变量的
hive -e " select * from stock_basic where stock_id=900920 " > /tmp/stock_id_equal_900920.txt # 先生成文件
set hiveconf:loc_txt=/tmp/stock_id_equal_900920.txt; #注意这里不要加引号
load data local inpath '${hiveconf:loc_txt}' overwrite into table stock_basic_partition partition (stock_id='900920');

sql

注意 若是没有 overwrite 而新加入的文件和表原有的文件 名字同样，则会增长文件编号  如：stock_id_equal_900920.txt   stock_id_equal_900920_copy_1.txt
5.2经过查询向表里插入数据
    insert into  table  stock_basic_partition 
    partition( stock_id=900922)
    select * from  stock_basic where   stock_id=900922; 
    #这样写是错的。特别注意

    insert into  table  stock_basic_partition 
    partition( stock_id=900922)
    select 
    stock_name   
    ,stock_date   
    ,stock_start_price
    ,stock_max_price  
    ,stock_min_price  
    ,stock_end_price  
    ,stock_volume    
    ,stock_amount   
    from  stock_basic where   stock_id=900922;
    #验证了华为考试题 插入语句select 字段会比分区表表的字段少，少的是分区字段
    动态插入 若是按照上面一个一个写，太慢了
    set hive.exec.dynamic.partition.mode=nonstrict  #设置为非严格模式
    #注意格式 分区字段要放最后  
    insert  overwrite  table    stock_basic_partition partition ( stock_id ) select  
     stock_name                
    ,stock_date                
    ,stock_start_price         
    ,stock_max_price           
    ,stock_min_price           
    ,stock_end_price           
    ,stock_volume              
    ,stock_amount 
    ,stock_id       
    from  stock_basic     ;
    要在非严格模式下才行

    动态分区属性其余百度
    hive.exec.dynamic.partition   
    hive.exec.dynamic.partition.mode 
5.3 单个查询语句建立表而且加载数据  ctas 
     create table  as  select  from 
     《hive实战》：ctas目标表没法是 外部表，分区表，分桶表  结论：正确，虽然来源表是分区表 可是目标表已经不是分区表了，和like 不一样
    create  table stock_basic_partition_ctas  as select  * from  stock_basic_partition;
5.4 导出数据 注意 这是hdfs  上 不是本地  
    若是文件格式恰好合适直接导出
    hadoop fs -cp  source_path  target_path  
    hadoop fs -cp  hdfs://master:9000/hive_dir/date_stock  /tmp/data_from_hive   #命令行
    #不行。。
    insert overwrite local  directory  '/tmp/data_from_hive' select  * from  date_stock;  #hive 界面  这个生成的目录很怪

    和插入对应输出也能够输入到不一样的文件夹
    from  stock_basic_partition  sp
    insert overwrite  directory  '/tmp/600896'
        select  * where  stock_id=600896
    insert overwrite  directory  '/tmp/600897'
        select  * where  stock_id=600897
    insert overwrite  directory  '/tmp/600898'
        select  * where  stock_id=600898

第六章 HiveQL查询
6.1 select from 语句
6.1.1正则表达式来指定列很鸡肋这个是用于结构化字段的
select a,b. ,from aa ;
6.1.4 使用函数
聚合函数 set hive.map.aggr=true ;在map阶段进行汇集需消耗不少内存
6.1.9 什么状况能够避免进行 mapreduce
查询通常会触发一个 mapreduce 其实不必。也就是设置本地模式：
set hive.exec.mode.local.auto =true
6.2 where 语句
6.2.2 关于浮点数的比较
double 0.2 可能表示的是 0.200000010000 (12位) float 0.2 可能表示的是 0.20000000 （8位）
因此 select

from stock_basic_partition where stock_amount>0.2 可能会出现等于 0.2的记录
须要改写为 select from stock_basic_partition where stock_amount> cast(0.2 as float)
6.4 join 语句
hive只支持等值链接 on 子句不能出现 or 。可是能够出现 on a.a=b.a and a.b>b.b
on的时候将大表放右边由于hive从左往右执行，并且会把左边表缓存起来
left semi join 是 in 或者exists的优化
mapjoin 不支持 right join 和full join 语句
6.5 order by 和sort by
order by 是全局的排序全部数据都会经过一个reducer 处理 sort by 是每一个reduce里面的排序
6.6 含有 sort by 的distribute by
distribute by 是控制map 的输出在reduce里是如何划分的。因为 reduce 是按照 map的键值对均匀分发到不一样reduce去
会致使不一样的reduce会有些重叠。而用了 distribute by会保证相同key 会分发到相同的reduce去
好比股票按照 distribute by stock_id sort by stock_id,stock_amount 这样就能保证相同的股票id 是在一块儿的。
select

from stock_basic_partition distribute by stock_id sort by stock_id,stock_amount;
注意 distribute by 必须在 sort by 前面
6.7 cluster by
若是 6.6 中 sort by 和 distribute by 字段同样能够用 cluster by
select from stock_basic_partition cluster by stock_id;
6.9抽样查询不太懂
按照 rand() 随机抽取
select

from stock_basic_partition tablesample( bucket 3 out of 10 on rand());
按照字段随机抽取
select from stock_basic_partition tablesample( bucket 3 out of 51 on stock_id );
数据块抽样
select

from stock_basic tablesample(0.1 percent);

shell