组件安排以下:java
172.16.57.75 bd-ops-test-75 mysql-server 172.16.57.77 bd-ops-test-77 Hiveserver2 HiveMetaStore
在77上安装hive:node
# yum install hive hive-metastore hive-server2 hive-jdbc hive-hbase -y
在其余节点上能够安装客户端:mysql
# yum install hive hive-server2 hive-jdbc hive-hbase -y
yum方式安装mysql:sql
# yum install mysql mysql-devel mysql-server mysql-libs -y
启动数据库:shell
# 配置开启启动 # chkconfig mysqld on # service mysqld start
安装jdbc驱动:数据库
# yum install mysql-connector-java # ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
设置mysql初始密码为bigdata:apache
# mysqladmin -uroot password 'bigdata'
进入数据库后执行以下:vim
CREATE DATABASE metastore; USE metastore; SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-1.1.0.mysql.sql; CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive'; GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'localhost'; GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'%'; FLUSH PRIVILEGES;
注意:建立的用户为 hive,密码为 hive ,你能够按本身须要进行修改。bash
修改 hive-site.xml 文件中如下内容:app
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://172.16.57.75:3306/metastore?useUnicode=true&characterEncoding=UTF-8</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property>
修改/etc/hadoop/conf/hadoop-env.sh
,添加环境变量 HADOOP_MAPRED_HOME
,若是不添加,则当你使用 yarn 运行 mapreduce 时候会出现 UNKOWN RPC TYPE
的异常
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
在 hdfs 中建立 hive 数据仓库目录:
/user/hive/warehouse
,建议修改其访问权限为 1777
,以便其余全部用户均可以建立、访问表,但不能删除不属于他的表。/user
目录下,如 root 用户的为 /user/root
)/tmp
必须是 world-writable 权限的。建立目录并设置权限:
# sudo -u hdfs hadoop fs -mkdir /user/hive # sudo -u hdfs hadoop fs -chown hive /user/hive # sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse # sudo -u hdfs hadoop fs -chmod 1777 /user/hive/warehouse # sudo -u hdfs hadoop fs -chown hive /user/hive/warehouse
修改hive-env设置jdk环境变量 :
# vim /etc/hive/conf/hive-env.sh export JAVA_HOME=/opt/programs/jdk1.7.0_67
启动hive-server和metastore:
# service hive-metastore start # service hive-server2 start
$ hive -e'create table t(id int);' $ hive -e'select * from t limit 2;' $ hive -e'select id from t;'
访问beeline:
$ beeline beeline> !connect jdbc:hive2://localhost:10000;
先安装 hive-hbase:
# yum install hive-hbase -y
若是你是使用的 cdh4,则须要在 hive shell 里执行如下命令添加 jar:
$ ADD JAR /usr/lib/hive/lib/zookeeper.jar; $ ADD JAR /usr/lib/hive/lib/hbase.jar; $ ADD JAR /usr/lib/hive/lib/hive-hbase-handler-<hive_version>.jar # guava 包的版本以实际版本为准。 $ ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
若是你是使用的 cdh5,则须要在 hive shell 里执行如下命令添加 jar:
ADD JAR /usr/lib/hive/lib/zookeeper.jar; ADD JAR /usr/lib/hive/lib/hive-hbase-handler.jar; ADD JAR /usr/lib/hbase/lib/guava-12.0.1.jar; ADD JAR /usr/lib/hbase/hbase-client.jar; ADD JAR /usr/lib/hbase/hbase-common.jar; ADD JAR /usr/lib/hbase/hbase-hadoop-compat.jar; ADD JAR /usr/lib/hbase/hbase-hadoop2-compat.jar; ADD JAR /usr/lib/hbase/hbase-protocol.jar; ADD JAR /usr/lib/hbase/hbase-server.jar;
以上你也能够在 hive-site.xml 中经过 hive.aux.jars.path
参数来配置,或者你也能够在 hive-env.sh 中经过 export HIVE_AUX_JARS_PATH=
来设置。
与Hive相似,Impala也能够直接与HDFS和HBase库直接交互。只不过Hive和其它创建在MapReduce上的框架适合须要长时间运行的批处理任务。例如:那些批量提取,转化,加载(ETL)类型的Job,而Impala主要用于实时查询。
组件分配以下:
172.16.57.74 bd-ops-test-74 impala-state-store impala-catalog impala-server 172.16.57.75 bd-ops-test-75 impala-server 172.16.57.76 bd-ops-test-76 impala-server 172.16.57.77 bd-ops-test-77 impala-server
在74节点安装:
yum install impala-state-store impala-catalog impala-server -y
在7五、7六、77节点上安装:
yum install impala-server -y
查看安装路径:
# find / -name impala /var/run/impala /var/lib/alternatives/impala /var/log/impala /usr/lib/impala /etc/alternatives/impala /etc/default/impala /etc/impala /etc/default/impala
impalad的配置文件路径由环境变量IMPALA_CONF_DIR
指定,默认为/usr/lib/impala/conf
,impala 的默认配置在/etc/default/impala,修改该文件中的 IMPALA_CATALOG_SERVICE_HOST
和 IMPALA_STATE_STORE_HOST
IMPALA_CATALOG_SERVICE_HOST=bd-ops-test-74 IMPALA_STATE_STORE_HOST=bd-ops-test-74 IMPALA_STATE_STORE_PORT=24000 IMPALA_BACKEND_PORT=22000 IMPALA_LOG_DIR=/var/log/impala IMPALA_CATALOG_ARGS=" -log_dir=${IMPALA_LOG_DIR} -sentry_config=/etc/impala/conf/sentry-site.xml" IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}" IMPALA_SERVER_ARGS=" \ -log_dir=${IMPALA_LOG_DIR} \ -use_local_tz_for_unix_timestamp_conversions=true \ -convert_legacy_hive_parquet_utc_timestamps=true \ -catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \ -state_store_port=${IMPALA_STATE_STORE_PORT} \ -use_statestore \ -state_store_host=${IMPALA_STATE_STORE_HOST} \ -be_port=${IMPALA_BACKEND_PORT} \ -server_name=server1\ -sentry_config=/etc/impala/conf/sentry-site.xml" ENABLE_CORE_DUMPS=false # LIBHDFS_OPTS=-Djava.library.path=/usr/lib/impala/lib # MYSQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java.jar # IMPALA_BIN=/usr/lib/impala/sbin # IMPALA_HOME=/usr/lib/impala # HIVE_HOME=/usr/lib/hive # HBASE_HOME=/usr/lib/hbase # IMPALA_CONF_DIR=/etc/impala/conf # HADOOP_CONF_DIR=/etc/impala/conf # HIVE_CONF_DIR=/etc/impala/conf # HBASE_CONF_DIR=/etc/impala/conf
设置 impala 可使用的最大内存:在上面的 IMPALA_SERVER_ARGS
参数值后面添加 -mem_limit=70%
便可。
若是须要设置 impala 中每个队列的最大请求数,须要在上面的 IMPALA_SERVER_ARGS
参数值后面添加 -default_pool_max_requests=-1
,该参数设置每个队列的最大请求数,若是为-1,则表示不作限制。
在节点74上建立hive-site.xml
、core-site.xml
、hdfs-site.xml
的软连接至/etc/impala/conf
目录并做下面修改在hdfs-site.xml
文件中添加以下内容:
<property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/run/hadoop-hdfs/dn._PORT</value> </property> <property> <name>dfs.datanode.hdfs-blocks-metadata.enabled</name> <value>true</value> </property>
同步以上文件到其余节点。
在每一个节点上建立/var/run/hadoop-hdfs:
# mkdir -p /var/run/hadoop-hdfs
impala 安装过程当中会建立名为 impala 的用户和组,不要删除该用户和组。
若是想要 impala 和 YARN 和 Llama 合做,须要把 impala 用户加入 hdfs 组。
impala 在执行 DROP TABLE 操做时,须要把文件移到到 hdfs 的回收站,因此你须要建立一个 hdfs 的目录 /user/impala,并将其设置为impala 用户可写。一样的,impala 须要读取 hive 数据仓库下的数据,故须要把 impala 用户加入 hive 组。
impala 不能以 root 用户运行,由于 root 用户不容许直接读。
建立 impala 用户家目录并设置权限:
sudo -u hdfs hadoop fs -mkdir /user/impala sudo -u hdfs hadoop fs -chown impala /user/impala
查看 impala 用户所属的组:
# groups impala impala : impala hadoop hdfs hive
由上可知,impala 用户是属于 imapal、hadoop、hdfs、hive 用户组的 。
在 74节点启动:
# service impala-state-store start # service impala-catalog start
使用impala-shell
启动Impala Shell,链接 74,并刷新元数据
#impala-shell Starting Impala Shell without Kerberos authentication Connected to bd-dev-hadoop-70:21000 Server version: impalad version 2.3.0-cdh5.5.1 RELEASE (build 73bf5bc5afbb47aa7eab06cfbf6023ba8cb74f3c) *********************************************************************************** Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved. (Impala Shell v2.3.0-cdh5.5.1 (73bf5bc) built on Wed Dec 2 10:39:33 PST 2015) After running a query, type SUMMARY to see a summary of where time was spent. *********************************************************************************** [bd-dev-hadoop-70:21000] > invalidate metadata;
当在 Hive 中建立表以后,第一次启动 impala-shell 时,请先执行 INVALIDATE METADATA
语句以便 Impala 识别出新建立的表(在 Impala 1.2 及以上版本,你只须要在一个节点上运行 INVALIDATE METADATA
,而不是在全部的 Impala 节点上运行)。
你也能够添加一些其余参数,查看有哪些参数:
#impala-shell -h Usage: impala_shell.py [options] Options: -h, --help show this help message and exit -i IMPALAD, --impalad=IMPALAD <host:port> of impalad to connect to [default: bd-dev-hadoop-70:21000] -q QUERY, --query=QUERY Execute a query without the shell [default: none] -f QUERY_FILE, --query_file=QUERY_FILE Execute the queries in the query file, delimited by ; [default: none] -k, --kerberos Connect to a kerberized impalad [default: False] -o OUTPUT_FILE, --output_file=OUTPUT_FILE If set, query results are written to the given file. Results from multiple semicolon-terminated queries will be appended to the same file [default: none] -B, --delimited Output rows in delimited mode [default: False] --print_header Print column names in delimited mode when pretty- printed. [default: False] --output_delimiter=OUTPUT_DELIMITER Field delimiter to use for output in delimited mode [default: \t] -s KERBEROS_SERVICE_NAME, --kerberos_service_name=KERBEROS_SERVICE_NAME Service name of a kerberized impalad [default: impala] -V, --verbose Verbose output [default: True] -p, --show_profiles Always display query profiles after execution [default: False] --quiet Disable verbose output [default: False] -v, --version Print version information [default: False] -c, --ignore_query_failure Continue on query failure [default: False] -r, --refresh_after_connect Refresh Impala catalog after connecting [default: False] -d DEFAULT_DB, --database=DEFAULT_DB Issues a use database command on startup [default: none] -l, --ldap Use LDAP to authenticate with Impala. Impala must be configured to allow LDAP authentication. [default: False] -u USER, --user=USER User to authenticate with. [default: root] --ssl Connect to Impala via SSL-secured connection [default: False] --ca_cert=CA_CERT Full path to certificate file used to authenticate Impala's SSL certificate. May either be a copy of Impala's certificate (for self-signed certs) or the certificate of a trusted third-party CA. If not set, but SSL is enabled, the shell will NOT verify Impala's server certificate [default: none] --config_file=CONFIG_FILE Specify the configuration file to load options. File must have case-sensitive '[impala]' header. Specifying this option within a config file will have no effect. Only specify this as a option in the commandline. [default: /root/.impalarc] --live_summary Print a query summary every 1s while the query is running. [default: False] --live_progress Print a query progress every 1s while the query is running. [default: False] --auth_creds_ok_in_clear If set, LDAP authentication may be used with an insecure connection to Impala. WARNING: Authentication credentials will therefore be sent unencrypted, and may be vulnerable to attack. [default: none]
使用 impala 导出数据:
impala-shell -i '172.16.57.74:21000' -r -q "select * from test" -B --output_delimiter="\t" -o result.txt