一。impala架构html
Impala是Cloudera在受到Google的Dremel启发下开发的实时交互SQL大数据查询工具,Impala没有再使用缓慢的Hive+MapReduce批处理,而是经过使用与商用并行关系数据库中相似的分布式查询引擎(由Query Planner、Query Coordinator和Query Exec Engine三部分组成),能够直接从HDFS或HBase中用SELECT、JOIN和统计函数查询数据,从而大大下降了延迟。前端
impala架构图:java
Impala由三个服务组成:impalad, statestored, catalogd。
node
执行计划:
Impala: 经过词法分析生成执行计划,执行计划表现为一棵完整的执行计划树,能够更天然地分发执行计划到各个Impalad执行查询,在分发执行计划后,Impala使用拉式获取数据的方式获取结果,把结果数据组成按执行树流式传递聚集,减小的了把中间结果写入磁盘的步骤,再从磁盘读取数据的开销。
impala的前端负责将sql转化成执行计划(java),包含两个阶段:单节点计划生成、并行化和分段。第一阶段对sql进行解析、分析、优化(RBO和CBO,统计信息目前只有表大小和列的NDV,无histogram),第二阶段生成分布式的执行计划,肯定是否要加exchange节点(是否存在partitioned join或hash aggregation),选择join strategy(partitioned join or broadcast join)等,最后以exchange为边界将计划分段(fragment),做为impala的基本运行单元。
impala相对于hive优缺点:
优势:
python
二。impala 安装mysql
使用cdh安装 参考以前环境(https://blog.csdn.net/liaomin416100569/article/details/80045833) 确保安装以前 先
安装了hadoop和hive
cdh集群 添加服务
由于以前hadoop位于cdh4(单机版)
hive 安装于 cdh3(单机)
catalogserver和statestore安装在 cdh3
impala daemon 必须安装数据节点cdh4上
web
安装完成 安装路径时/opt/cloudera/parcels/CDH/lib/impalasql
注意通常若是出错 能够 查看 /var/log对应目录的日志信息
服务 启动impalashell
三。impala shell和sql数据库
在任意的cdh节点上 运行命令 impala-shell 便可操做impala 只有 cdh4上有impalad进程 他能够直接登陆 其余机器
[root@cdh2 impala]# impala-shell -i cdh4 Starting Impala Shell without Kerberos authentication Connected to cdh4:21000 Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a) *********************************************************************************** Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved. (Impala Shell v2.5.0-cdh5.7.6 (ecbba4f) built on Tue Feb 21 14:54:50 PST 2017) After running a query, type SUMMARY to see a summary of where time was spent. *********************************************************************************** [cdh4:21000] >该命令的一些可选项 解释
root@cdh4 ~]# impala-shell --help Usage: impala_shell.py [options] Options: -i IMPALAD, --impalad=IMPALAD <host:port> 链接到对应的impalad的服务器 [默认是: cdh4:21000] -q QUERY, --query=QUERY 直接能够在命令行执行一个查询的sql语句 -f QUERY_FILE, --query_file=QUERY_FILE 执行查询文件的查询sql语句, 多条使用 ; 隔开 [default: none] -o OUTPUT_FILE, --output_file=OUTPUT_FILE 设置将查询结构输出到文件中 --print_header 查询时是否打印表头 [default: False] --output_delimiter=OUTPUT_DELIMITER 输出内容的行的列的分隔符 默认是\t [default: \t] -r, --refresh_after_connect 链接以后刷新 Impala catalog 自动从hive的metastore同步数据库及表结构信息等元数据 [default: False] -d DEFAULT_DB, --database=DEFAULT_DB 指定默认使用的数据库名称 等价于 use 数据库名 [default: none] -u USER, --user=USER 受权登陆的用[default: root]进入命令内部后 可使用的经常使用命令 以下:
1 。建立数据库(http://impala.apache.org/docs/build/html/topics/impala_create_database.html#create_database)
[cdh4:21000] > create database myimpala; Query: create database myimpala默认使用hive建立 目录位于hive的 /user/hive/warehouse下 查看
[root@cdh4 ~]# hdfs dfs -ls /user/hive/warehouse Found 1 items drwxrwxrwt - impala hive 0 2018-04-24 16:54 /user/hive/warehouse/myimpala.db2。表操做(http://impala.apache.org/docs/build/html/topics/impala_tables.html)
内表表示元数据和文件数据被hive内部常常管理 删除内表数据 全部数据都会被删除 默认表都是内表
外表 表示文件数据被外部管理删除外表 外表数据不会被删除 好比多个表引用同一份数据时适用于使用外表
这里演示简单例子 同hive同样
建立表
CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;/soft目录下 a.txt 内容
[root@cdh4 soft]# more a.txt 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.0 2015-12-13 10:56:20,1,www.baidu.com,www.qq.com,192.168.99.1 2015-12-13 9:56:20,1,www.baidu.com,www.qq.com,192.168.99.2 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.3 2015-12-13 44:56:20,1,www.baidu.com,www.qq.com,192.168.99.4将文件传送到hdfs上
hdfs dfs -mkdir /im hdfs dfs -put -f a.txt /im执行load时报错
[cdh4:21000] > LOAD DATA INPATH '/im/a.txt' INTO TABLE page_view ; Query: load DATA INPATH '/a.txt' INTO TABLE page_view ERROR: AnalysisException: Unable to LOAD DATA from hdfs://cdh4:8020/a.txt because Impala does not have WRITE permissions on its parent directory hdfs://cdh4:8020/由于Impala使用impala用户操做hdfs 因此没有权限 (hadoop操做用户是使用 hdfs 经过 hdfs dfs -ls / 查看)
[root@cdh4 soft]# hadoop fs -chown -R impala:supergroup /im chown: changing ownership of '/im': Non-super user cannot change owner [root@cdh4 soft]# su - hdfs [hdfs@cdh4 ~]$ hadoop fs -chown -R impala:supergroup /im再次尝试shell导入 查看数据
[cdh4:21000] > select * from page_view; Query: select * from page_view +----------+--------+---------------+--------------+--------------+ | viewtime | userid | page_url | referrer_url | ip | +----------+--------+---------------+--------------+--------------+ | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.0 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.1 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.2 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.3 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.4 | +----------+--------+---------------+--------------+--------------+ WARNINGS: Error converting column: 0 TO INT (Data is: 2015-12-13 11:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.0 Error converting column: 0 TO INT (Data is: 2015-12-13 10:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 10:56:20,1,www.baidu.com,www.qq.com,192.168.99.1 Error converting column: 0 TO INT (Data is: 2015-12-13 9:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 9:56:20,1,www.baidu.com,www.qq.com,192.168.99.2 Error converting column: 0 TO INT (Data is: 2015-12-13 11:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.3 Error converting column: 0 TO INT (Data is: 2015-12-13 44:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 44:56:20,1,www.baidu.com,www.qq.com,192.168.99.4 Fetched 5 row(s) in 1.37sviewtime 定义成了int类型 全部没法导入 修改表
alter table page_view change viewTime viewTime STRING
再次查看
[cdh4:21000] > select * from page_view; Query: select * from page_view +---------------------+--------+---------------+--------------+--------------+ | viewtime | userid | page_url | referrer_url | ip | +---------------------+--------+---------------+--------------+--------------+ | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.0 | | 2015-12-13 10:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.1 | | 2015-12-13 9:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.2 | | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.3 | | 2015-12-13 44:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.4 | +---------------------+--------+---------------+--------------+--------------+ Fetched 5 row(s) in 1.40s3。表分区(http://impala.apache.org/docs/build/html/topics/impala_tables.html)
仍是以前的数据 建立一个parquet格式 而且是分区数据 建立表
CREATE TABLE page_view_parquet( viewTime STRING, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING ) PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS PARQUET;由于咱们上传的的文本文件 /im/a,txt是文本文件因此不能直接load data会格式错误 能够直接插入数据 全部分区实例先添加
alter table page_view_parquet add partition (dt='2015-12-13', country='CHINA');插入数据测试
[cdh4:21000] > insert into page_view_parquet partition (dt='2015-12-13', country='CHINA') values('2015-12-13 11:56:20',1,'www.baidu.com','www.baidu.com','192.168.7.7'); Query: insert into page_view_parquet partition (dt='2015-12-13', country='CHINA') values('2015-12-13 11:56:20',1,'www.baidu.com','www.baidu.com','192.168.7.7') Inserted 1 row(s) in 0.80s [cdh4:21000] > select * from page_view_parquet; Query: select * from page_view_parquet +---------------------+--------+---------------+---------------+-------------+------------+---------+ | viewtime | userid | page_url | referrer_url | ip | dt | country | +---------------------+--------+---------------+---------------+-------------+------------+---------+ | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.baidu.com | 192.168.7.7 | 2015-12-13 | CHINA | +---------------------+--------+---------------+---------------+-------------+------------+---------+ Fetched 1 row(s) in 0.15s将以前page_view数据转换到page_view_parquet表 从 TEXTFILE格式转换成 PARQUET
[cdh4:21000] > insert into table page_view_parquet select * from page_view; Query: insert into table page_view_parquet select * from page_view ERROR: AnalysisException: Not enough partition columns mentioned in query. Missing columns are: dt, country [cdh4:21000] > insert into table page_view_parquet partition (dt='2015-12-13', country='CHINA') select * from page_view; Query: insert into table page_view_parquet partition (dt='2015-12-13', country='CHINA') select * from page_view Inserted 5 row(s) in 1.49s [cdh4:21000] > select * from page_view_parquet; Query: select * from page_view_parquet +---------------------+--------+---------------+---------------+--------------+------------+---------+ | viewtime | userid | page_url | referrer_url | ip | dt | country | +---------------------+--------+---------------+---------------+--------------+------------+---------+ | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.baidu.com | 192.168.7.7 | 2015-12-13 | CHINA | | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.0 | 2015-12-13 | CHINA | | 2015-12-13 10:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.1 | 2015-12-13 | CHINA | | 2015-12-13 9:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.2 | 2015-12-13 | CHINA | | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.3 | 2015-12-13 | CHINA | | 2015-12-13 44:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.4 | 2015-12-13 | CHINA | +---------------------+--------+---------------+---------------+--------------+------------+---------+ Fetched 6 row(s) in 0.32s经过cdh控制台 点击hdfs进入 点击namenode web ui 查看 hdfs具体建立的文件