hadoop记录篇6-数据仓库hive

时间 2019-11-19

标签 hadoop 记录数据仓库 hive 栏目 Hadoop 繁體版

原文原文链接

一。hive简介html

hive是基于hadoop文件系统的大数据分析工具，可以轻松实现数据汇总点对点查询大批量数据分析等使用传统的SQL语法提供了UDF 用户自定义函数来分析统计数据。
hive的数据组成：
数据库（Databases）相似于 mysql的数据库用于将不一样表进行区分的命名空间；
表(Table) ddl表预先定义列名和数据的格式 dml操做带有行和列的数据集；
分区(Partitions) 根据特定的列进行分区数据会被写到文件系统不一样的分区目录；
桶 Buckets (or Clusters) 定义hash的桶的个数和列插入数据时根据列的值%桶的个数决定文件写入哪一个文件java

hive架构：#aaaanode

hive启动将表结构数据存储在derby或者其余数据库中，将数据行存储在hdfs文件系统中客户端经过jdbc或者对应客户端发起查询时 hive将对应sql结构分析后经过metastore（表结构）优化后转换成mapreduce任务执行返回结果（图片摘自网络）mysql

hive适用于少许用户统计查询数据不适用于大批量用户请求linux

二。 hive安装web

1》hive单机安装(参考官方wiki https://cwiki.apache.org/confluence/display/Hive/Home)sql

hive使用hadoop来存储数据必须拥有hadoop环境使用关系型数据库存储表结构数据 hive内置了derby数据库默认使用derby 须要搭建hadoop集群mongodb

环境参考http://blog.csdn.net/liaomin416100569/article/details/78360734环境shell

主机信息数据库

/etc/hosts      
192.168.58.147 node1      
192.168.58.149 node2      
192.168.58.150 node3      
192.168.58.151 node4

节点信息

namenode      
   node1  nameserviceid:node1      
   node2  nameserviceid:node2    
secondarynode    
   node1    
   node2    
DataNodes      
   node2      
   node3      
   node4      
        
Resource Manager  
  node1  
NodeManger  
  node2  
  node3  
  node4

我这里是fendration因此将node1的core-site.xml 默认dfs修改为本机非viewfs

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://node1</value>
</property>

hive单机还安装在node1节点上

安装jdk1.7以上版本（安装过hadoop因此有了）
下载hive版本 apache-hive-2.2.0-bin.tar.gz （hive.apache.org）
解压hive安装包到： /soft/hive-2.2.0
/etc/profile添加

HIVE_HOME=/soft/hive-2.2.0
export HIVE_HOME
PATH=$PATH:${HIVE_HOME}/bin
export PATH

执行命令马上生效

source /etc/profile

node1的hdfs上建立几个目录用于存储数据

hdfs上建立关键目录
 hdfs dfs -mkdir       /tmp
 hdfs dfs  -mkdir  -p     /user/hive/warehouse
 hdfs dfs  -chmod g+w   /tmp
 hdfs dfs -chmod g+w   /user/hive/warehouse

单机方式metadata文件只能同一时间一个用户同时访问该方式不启用任何端口链接就直接访问元数据和文件系统（hive要读取hadoop知道hdfs默认地址因此必须配置HADOOP_HOME）
内嵌数据库 derby管理元数据
将 $HIVE_HOME/conf/metastore_db 删除或者备份为其余名字

mv $HIVE_HOME/conf/metastore_db $HIVE_HOME/conf/metastore_db_tmp

能够经过 /conf目录下的hive-default.xml.template下查看到如下四个选项分别是jdbc的url 驱动类用户名和密码

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=metastore_db;create=true</value> 默认使用derby管理元数据
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property> 
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>mine</value>
    <description>password to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>APP</value>
    <description>Username to use against metastore database</description>
  </property>
   <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

若是有须要将元数据存储在其余的数据库中能够将驱动jar包置于 ${HIVE_HOME}/lib目录下修改这四个值为对应数据库便可

hive支持的数据库经过如下方式查看

[root@node1 upgrade]# pwd
/soft/hive-2.2.0/scripts/metastore/upgrade
[root@node1 upgrade]# ll
total 24
drwxr-xr-x 2 root root   89 Oct 27 01:39 azuredb
drwxr-xr-x 2 root root 4096 Oct 27 01:39 derby
drwxr-xr-x 2 root root 4096 Oct 27 01:39 mssql
drwxr-xr-x 2 root root 4096 Oct 27 01:39 mysql
drwxr-xr-x 2 root root 4096 Oct 27 01:39 oracle
drwxr-xr-x 2 root root 4096 Oct 27 01:39 postgres

初始化 hive 使用内置的derby管理元数据（也就是表结构数据）

schematool -dbType derby -initSchema

初始化成功使用hive命令进入客户端控制

hive

能够使用相似mysql的语法建立数据库操做表等

若是过程当中发现没法链接等问题能够直接查看hive-log4j2.properties.template 查看日志文件在哪里
默认位置是：property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
也就是/tmp/root目录下面有个文件是 hive.log

2》hive服务器（hiveserver2+beeline）安装

hive服务器模式是启动一个端口对外发布多个客户端经过jdbcapi 登陆该端口进行统计查询该模式适用于多用户模式

顺便这里演示一下将元数据保存在mysql中

好比本机window mysql数据库 192.168.58.1 3306 root 123456
$HIVE_HOME/conf目录下新建文件 hive-site.xml 拷贝四要素并修改成mysql

configuration>

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://192.168.58.1:3306/metadb</value>  
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
</property> 
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description>password to use against metastore database</description>
  </property>
  <property> 
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>Username to use against metastore database</description>
  </property>
   <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
	
</configuration>

拷贝mysql驱动包到 ${HIVE_HOME}/lib目录下
58.1 mysql数据库上建立数据库 metadb

create database metadb

node1执行初始化metadb

[root@node1 lib]# schematool -dbType mysql -initSchema  
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:        jdbc:mysql://192.168.58.1:3306/metadb
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       root
Starting metastore schema initialization to 2.1.0
Initialization script hive-schema-2.1.0.mysql.sql
Initialization script completed

window机器上看出 metadb数据库下是否多了一堆表

mysql> use metadb
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_metadb          |
+---------------------------+
| aux_table                 |
| bucketing_cols            |
| cds                       |
| columns_v2                |
| compaction_queue          |
| completed_compactions     |
| completed_txn_components  |
| database_params           |
| db_privs                  |
| dbs                       |
| delegation_tokens         |
| func_ru                   |
| funcs                     |
| global_privs              |
| hive_locks                |
| idxs                      |
| index_params              |
| key_constraints           |
| master_keys               |
| next_compaction_queue_id  |
| next_lock_id              |
| next_txn_id               |
| notification_log          |
| notification_sequence     |
| nucleus_tables            |
| part_col_privs            |
| part_col_stats            |
| part_privs                |
| partition_events          |
| partition_key_vals        |
| partition_keys            |
| partition_params          |
| partitions                |
| role_map                  |
| roles                     |
| sd_params                 |
| sds                       |
| sequence_table            |
| serde_params              |
| serdes                    |
| skewed_col_names          |
| skewed_col_value_loc_map  |
| skewed_string_list        |
| skewed_string_list_values |
| skewed_values             |
| sort_cols                 |
| tab_col_stats             |
| table_params              |
| tbl_col_privs             |
| tbl_privs                 |
| tbls                      |
| txn_components            |
| txns                      |
| type_fields               |
| types                     |
| version                   |
| write_set                 |
+---------------------------+
57 rows in set (0.01 sec)

确保当前机器存在hadoop安装而且全部hadoop和yarn必须启动如下配置是hadoop的一种代理机制

修改hadoop的core-site.xml

<property>
    <name>hadoop.proxyuser.用户名.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.用户名.groups</name>
    <value>*</value>
</property>

用户名能够随便取 *表示hadoop集群中全部机器的用户能够使用本身指定的用户名假装代理就至关于我有了工牌就不用报名字了
hive默认使用的是匿名帐号没有任何权限全部链接时配置一个代理用户 -n 代理用户就能够访问hadoop下全部文件了代理
用户相关参考（http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/Superusers.html）

我假设hive链接是 root

<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>

node1（58.147）启动服务

hiveserver2

默认启动端口10000(用于jdbc链接) 10002是web监控界面
其余机器上也安装一个hive

beeline -u jdbc:hive2://192.168.58.147:10000 -n root 使用root用户链接
建立表 查询测试

注意 yarn和hadoop必定要所有启动（jps查看）由于测试插入数据调用mapreduce
客户端测试

[root@node2 bin]# beeline -u jdbc:hive2://192.168.58.147:10000 -n root
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://192.168.58.147:10000
Connected to: Apache Hive (version 2.2.0)
Driver: Hive JDBC (version 2.2.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.2.0 by Apache Hive
0: jdbc:hive2://192.168.58.147:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| test           |
+----------------+--+
2 rows selected (4.615 seconds)
0: jdbc:hive2://192.168.58.147:10000> drop database test;
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database test is not empty. One or more tables exist.) (state=08S01,code=1)
0: jdbc:hive2://192.168.58.147:10000> create database hello;
No rows affected (0.96 seconds)
0: jdbc:hive2://192.168.58.147:10000> use hello;
No rows affected (0.252 seconds)
0: jdbc:hive2://192.168.58.147:10000> create table tt(id int,name string);
No rows affected (1.403 seconds)
0: jdbc:hive2://192.168.58.147:10000> insert into tt values(1,'zs');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (45.633 seconds)
0: jdbc:hive2://192.168.58.147:10000> select * from tt;
+--------+----------+--+
| tt.id  | tt.name  |
+--------+----------+--+
| 1      | zs       |
+--------+----------+--+
1 row selected (1.515 seconds)

在hiveserver2控制台看到插入数据会启动mapreduce

测试成功。。。。

使用help查看全部beeline全部帮助

Beeline version 2.2.0 by Apache Hive
beeline> help
!addlocaldriverjar  Add driver jar file in the beeline client side.
!addlocaldrivername Add driver name that needs to be supported in the beeline
                    client side.
!all                Execute the specified SQL against all the current connections
!autocommit         Set autocommit mode on or off
!batch              Start or execute a batch of statements
!brief              Set verbose mode off
!call               Execute a callable statement
!close              Close the current connection to the database
!closeall           Close all current open connections
!columns            List all the columns for the specified table
!commit             Commit the current transaction (if autocommit is off)
!connect            Open a new connection to the database.
!dbinfo             Give metadata information about the database
!describe           Describe a table
!dropall            Drop all tables in the current database
!exportedkeys       List all the exported keys for the specified table
!go                 Select the current connection
!help               Print a summary of command usage
!history            Display the command history
!importedkeys       List all the imported keys for the specified table
!indexes            List all the indexes for the specified table
!isolation          Set the transaction isolation for this connection
!list               List the current connections
!manual             Display the BeeLine manual
!metadata           Obtain metadata information
!nativesql          Show the native SQL for the specified statement
!nullemptystring    Set to true to get historic behavior of printing null as
                    empty string. Default is false.
!outputformat       Set the output format for displaying results
                    (table,vertical,csv2,dsv,tsv2,xmlattrs,xmlelements, and
                    deprecated formats(csv, tsv))
!primarykeys        List all the primary keys for the specified table
!procedures         List all the procedures
!properties         Connect to the database specified in the properties file(s)
!quit               Exits the program
!reconnect          Reconnect to the database
!record             Record all output to the specified file
!rehash             Fetch table and column names for command completion
!rollback           Roll back the current transaction (if autocommit is off)
!run                Run a script from the specified file
!save               Save the current variabes and aliases
!scan               Scan for installed JDBC drivers
!script             Start saving a script to a file
!set                Set a beeline variable
!sh                 Execute a shell command
!sql                Execute a SQL command
!tables             List all the tables in the database
!typeinfo           Display the type map for the current connection
!verbose            Set verbose mode on

Comments, bug reports, and patches go to ???

退出命令 !quit

3》hcatalog非交互客户端

首先沿用2章节 hiveserver2的metastore

hcatalog能够直接在命令行直接执行sql 主要用执行ddl语句

建立日志目录

mkdir -p /soft/hive-2.2.0/hcatalog/var/log

启动服务

[root@node1 sbin]# ./hcat_server.sh start
Started metastore server init, testing if initialized correctly...
Metastore initialized successfully on port[9083].

当前机器使用命令

[root@node1 bin]# ./hcat -e "create table ggg(id int,name string)"

使用hive查看是否建立表

[root@node1 log]# hive
which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/mongodb-linux-x86_64-rhel70-3.4.9/bin:/soft/hadoop-2.7.4/bin:/soft/hadoop-2.7.4/sbin:/soft/hive-2.2.0/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/soft/hive-2.2.0/lib/hive-common-2.2.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
hello
test
Time taken: 3.325 seconds, Fetched: 3 row(s)
hive> show tables;
OK
ggg
Time taken: 0.209 seconds, Fetched: 1 row(s)
hive> desc ggg;
OK
id                      int                                         
name                    string                                      
Time taken: 0.926 seconds, Fetched: 2 row(s)

其余用法参考（https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat）

hcatalog只支持单机访问若是须要远程必须搭配webhcat服务

webhcat支持rest风格url操做hive

具体参考

安装 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ConfigurationManagementOverview

介绍https://cwiki.apache.org/confluence/display/Hive/WebHCat+UsingWebHCat