Hive on Spark

1、版本以下

注意:Hive on Spark对版本有着严格的要求,下面的版本是通过验证的版本java

a) apache-hive-2.3.2-bin.tar.gznode

b) hadoop-2.7.2.tar.gzmysql

c) jdk-8u144-linux-x64.tar.gzlinux

d) mysql-5.7.19-1.el7.x86_64.rpm-bundle.tarsql

e) mysql-connector-java-5.1.43-bin.jarshell

f) spark-2.0.0.tgzspark源码包,须要从源码编译)数据库

g) Redhat Linux 7.4 64apache

2、安装LinuxJDK、关闭防火墙

 

3、安装和配置MySQL数据库

a) 解压MySQL 安装包

b) 安装MySQL

yum remove mysql-libs分布式

rpm -ivh mysql-community-common-5.7.19-1.el7.x86_64.rpmide

rpm -ivh mysql-community-libs-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-client-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-server-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-devel-5.7.19-1.el7.x86_64.rpm  (可选)

 

c) 启动MySQL

systemctl start mysqld.service

 

d) 查看并修改root用户的密码

查看root用户的密码:cat /var/log/mysqld.log | grep password

登陆后修改密码:alter user 'root'@'localhost' identified by 'Welcome_1';
 

e) 建立hive的数据库和hiveowner用户:

  • 建立一个新的数据库:create database hive;
  • 建立一个新的用户:
    create user 'hiveowner'@'%' identified by ‘Welcome_1’;
  • 给该用户受权

    grant all on hive.* TO 'hiveowner'@'%';

    grant all on hive.* TO 'hiveowner'@'localhost' identified by 'Welcome_1';

4、安装Hadoop(以伪分布式为例)

因为Hive on Spark默认支持Spark on Yarn的方式,因此须要配置Hadoop

a) 准备工做:

  1. 配置主机名(编辑/etc/hosts文件)
  2. 配置免密码登陆

b) Hadoop的配置文件以下:

hadoop-env.sh

JAVA_HOME

/root/training/jdk1.8.0_144

 

hdfs-site.xml

dfs.replication

1

数据块的冗余度,默认是3

dfs.permissions

false

是否开启HDFS的权限检查

core-site.xml

fs.defaultFS

hdfs://hive77:9000

NameNode的地址

hadoop.tmp.dir

/root/training/hadoop-2.7.2/tmp/

HDFS数据保存的目录

mapred-site.xml

mapreduce.framework.name

yarn

 

yarn-site.xml

yarn.resourcemanager.hostname

hive77

 

yarn.nodemanager.aux-services

mapreduce_shuffle

 

yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Spark on Yarn的方式,须要使用公平调度原则来保证Yarn集群中的任务都能获取到相等的资源运行。

 

c)  启动Hadoop

d) 经过Yarn Web Console检查是否为公平调度原则

5、编译Spark源码

(须要使用MavenSpark源码包中自带Maven

a) 执行下面的语句进行编译(执行时间很长,耐心等待

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

 

b) 编译成功后,会生成:spark-2.0.0-bin-hadoop2-without-hive.tgz

c) 安装和配置Spark

  1.目录结构以下:

  2.将下面的配置加入spark-env.sh

export JAVA_HOME=/root/training/jdk1.8.0_144

export HADOOP_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop

export YARN_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop

export SPARK_MASTER_HOST=hive77

export SPARK_MASTER_PORT=7077

export SPARK_EXECUTOR_MEMORY=512m

export SPARK_DRIVER_MEMORY=512m

export SPARK_WORKER_MEMORY=512m

  3.将hadoop的相关jar包放入sparklib目录下,以下:

cp ~/training/hadoop-2.7.2/share/hadoop/common/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/common/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/yarn/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/yarn/lib/*.jar jars/

  4.在HDFS上建立目录:spark-jars,并将sparkjars上传至该目录。这样在运行Application的时候,就无需每次都分发这些jar包。

  • hdfs dfs -mkdir /spark-jars
  • hdfs dfs -put jars/*.jar /spark-jars

d) 启动Sparksbin/start-all.sh,验证Spark是否配置成功
 

6、安装配置Hive

a) 解压Hive安装包,并把mysqlJDBC驱动放到HIvelib目录下,以下图:

 

b) 设置Hive的环境变量

HIVE_HOME=/root/training/apache-hive-2.3.2-bin

export HIVE_HOME

PATH=$HIVE_HOME/bin:$PATH

export PATH

c) 拷贝下面sparkjar包到Hivelib目录

  1. scala-library
  2. spark-core
  3. spark-network-common

d) HDFS上建立目录:/sparkeventlog用于保存log信息

  hdfs dfs -mkdir /sparkeventlog

 

e) 配置hive-site.xml,以下:

参数

参考值

javax.jdo.option.ConnectionURL

jdbc:mysql://localhost:3306/hive?useSSL=false

javax.jdo.option.ConnectionDriverName

com.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserName

hiveowner

javax.jdo.option.ConnectionPassword

Welcome_1

hive.execution.engine

spark

hive.enable.spark.execution.engine

true

spark.home

/root/training/spark-2.0.0-bin-hadoop2-without-hive

spark.master

yarn-client

spark.eventLog.enabled

true

spark.eventLog.dir

hdfs://hive77:9000/sparkeventlog

spark.serializer

org.apache.spark.serializer.KryoSerializer

spark.executor.memeory

512m

spark.driver.memeory

512m

 

f) 初始化MySQL数据库:schematool -dbType mysql -initSchema

g) 启动hive shell,并建立员工表,用于保存员工数据
 

h) 导入emp.csv文件:

  load data local inpath '/root/temp/emp.csv' into table emp1;

 

i) 执行查询,按照员工薪水排序:(执行失败

  select * from emp1 order by sal;

j) 检查Yarn Web Console

该错误是因为是Yarn的虚拟内存计算方式致使,可在yarn-site.xml文件中,将yarn.nodemanager.vmem-check-enabled设置为false,禁用虚拟内存检查。

<property>

   <name>yarn.nodemanager.vmem-check-enabled</name>

   <value>false</value>

</property>

 

 

k) 重启:HadoopSparkHive,并执行查询

 

 

最后说明一下:因为配置好了Spark on Yarn,咱们在执行Hive的时候,能够不用启动Spark集群,由于此时都有Yarn进行管理。

相关文章
相关标签/搜索