Hadoop分布式集群部署

Hadoop分布式集群部署

一、系统参数配置优化

 

1、系统内核参数优化配置

修改文件/etc/sysctl.conf,添加如下配置,然后执行sysctl -p命令使配置生效

net.ipv4.conf.all.arp_notify = 1

kernel.shmmax = 500000000

kernel.shmmni = 4096

kernel.shmall = 4000000000

kernel.sem = 250 512000 100 2048

kernel.sysrq = 1

kernel.core_uses_pid = 1

kernel.msgmnb = 65536

kernel.msgmax = 65536

kernel.msgmni = 2048

net.ipv4.tcp_syncookies = 1

net.ipv4.ip_forward = 0

net.ipv4.conf.default.accept_source_route = 0

net.ipv4.tcp_tw_recycle = 1

net.ipv4.tcp_max_syn_backlog = 4096

net.ipv4.conf.all.arp_filter = 1

net.ipv4.ip_local_port_range = 1025 65535

net.core.netdev_max_backlog = 10000

net.core.rmem_max = 2097152

net.core.wmem_max = 2097152

vm.overcommit_memory = 2

2、修改Linux最大限制

追加如下配置到文件/etc/security/limits.conf

* soft nofile 65536

* hard nofile 65536

* soft nproc 131072

* hard nproc 131072

3、磁盘I/O优化调整

Linux磁盘I/O调度器对磁盘的访问支持不同的策略,默认的为CFQ,建议设置为deadline。

我这里是sda磁盘,你需要根据你的磁盘进行IO调度策略设置,如下设置:

#echo deadline > /sys/block/sda/queue/scheduler

如果想永久生效,加入到/etc/rc.local即可。

以上3步都配置完毕后,重启操作系统生效。

 

二、安装前环境配置

1、部署环境清单说明

2、设置主机名

192.168.10.91:

#hostname hadoop-nn

#echo "hostname hadoop-nn" >> /etc/rc.local

192.168.10.92:

#hostname hadoop-snn

#echo "hostname hadoop-snn" >> /etc/rc.local

192.168.10.93:

#hostname hadoop-dn-01

#echo "hostname hadoop-dn-01" >> /etc/rc.local

 3、关闭防火墙(所有节点)

#systemctl stop firewalld

#systemctl disable firewalld

4、关闭SELinux(所有节点)

#setenforce 0

#sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

5、添加以下配置到/etc/hosts(所有节点)

192.168.10.91 hadoop-nn master

192.168.10.92 hadoop-snn slave01

192.168.10.93 hadoop-dn-01 slave02

6、NTP时间同步

在Hadoop namenode节点安装ntp服务器,然后其他各个节点都同步namenode节点的时间。

192.168.10.91:

#yum -y install ntp

#systemctl start ntpd

#systemctl enable ntpd

然后在其他节点同步ntp时间。

192.168.10.92;192.168.10.93:

#yum -y install ntp

#ntpdate hadoop-nn

添加一个计划任务,Hadoop需要各个节点时间的时间都是一致的,切记。

 

三、开始部署Hadoop

1、 安装JAVA(所有节点)

#yum -y install java java-devel

查看java版本,确保此命令没有问题

#java -version

openjdk version "1.8.0_161"

OpenJDK Runtime Environment (build 1.8.0_161-b14)

OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

另外openjdk安装后,不会默许设置JAVA_HOME环境变量,要查看安装后的目录,可以用命令。

#update-alternatives --config jre_openjdk

There is 1 program that provides 'jre_openjdk'.

  Selection    Command

-----------------------------------------------

*+ 1          java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre)

默认jre目录为:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre

设置环境变量,可用编辑/etc/profile.d/java.sh

#!/bin/bash

#

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64

export CLASSPATH=$JAVA_HOME/lib/rt.jar:$JAVA_HOME/../lib/dt.jar:$JAVA_HOME/../lib/tools.jar

export PATH=$PATH:$JAVA_HOME/bin

完成这项操作之后,需要重新登录,或source一下profile文件,如果你对大数据开发感兴趣,想系统学习大数据的话,可以加入大数据技术学习交流扣群:522数字189数字307以便环境变量生效,当然也可以手工运行一下,以即时生效

#source /etc/profile.d/java.sh

2、创建hadoop用户(所有节点)

#useradd hadoop

#passwd hadoop

设置密码,为简单起见,3台机器上的hadoop密码最好设置成一样,比如123456。为了方便,建议将hadoop加入root用户组,操作方法:

#usermod -g root hadoop

执行完后hadoop即归属于root组了,可以再输入id hadoop查看输出验证一下,如果看到类似下面的输出:

#id hadoop

uid=1002(hadoop) gid=0(root) groups=0(root)

3、在NameNode节点创建秘钥 

192.168.10.91:

创建RSA秘钥对

#su - hadoop

$ssh-keygen

在NameNode节点复制公钥到所有节点Hadoop用户目录下,包括自己:

$ssh-copy-id [email protected]

$ssh-copy-id [email protected]

$ssh-copy-id [email protected]

4、解压Hadoop二进制包并设置环境变量(所有节点)

#wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz

#tar xf hadoop-2.8.4.tar.gz -C /usr/local/

#ln -sv /usr/local/hadoop-2.8.4/ /usr/local/hadoop

编辑环境配置文件/etc/profile.d/hadoop.sh,定义类似如下环境变量,设定Hadoop的运行环境

#!/bin/bash

#

export HADOOP_PREFIX="/usr/local/hadoop"

export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin

export HADOOP_COMMON_HOME=${HADOOP_PREFIX}

export HADOOP_HDFS_HOME=${HADOOP_PREFIX}

export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}

export HADOOP_YARN_HOME=${HADOOP_PREFIX}

创建数据和日志目录:

#mkdir -pv /data/hadoop/hdfs/{nn,snn,dn}

#chown -R hadoop:hadoop /data/hadoop/hdfs

#mkdir -pv /var/log/hadoop/yarn

#chown -R hadoop:hadoop /var/log/hadoop/yarn

而后,在Hadoop的安装目录中创建logs目录,并修改Hadoop所有文件的属主和属组。

#cd /usr/local/hadoop

#mkdir logs

#chmod g+w logs

#chown -R hadoop:hadoop ./*

 

四、配置所有Hadoop节点

1、hadoop-nn节点

需要配置以下几个文件。

core-site.xml

core-size.xml文件包含了NameNode主机地址以及其监听RPC端口等信息(NameNode默认使用的RPC端口为8020),对于分布式环境,每个节点都需要设置NameNode主机地址,其简要的配置内容如下所示:

#su - hadoop

$vim /usr/local/hadoop/etc/hadoop/core-site.xml

添加以下配置:

<configuration>

    <property>

        <name>fs.defaultFS</name>

        <value>hdfs://master:8020</value>

        <final>true</final>

    </property>

</configuration>

hdfs-site.xml

hdfs-site.xml主要用于配置HDFS相关的属性,例如复制因子(即数据块的副本数),NN和DN用于存储数据的目录等。数据块的副本数对于分布式的Hadoop应该为3,这里我设置为2,为了减少磁盘使用。而NN和DN用于存储数据的目录为前面的步骤中专门为其创建的路径。另外,前面的步骤中也为SNN创建了相关的目录,这里也一并配置为启用状态。

#su - hadoop

$vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>

        <property>

                <name>dfs.replication</name>

                <value>2</value>    

        </property>

        <property>

                <name>dfs.namenode.name.dir</name>

                <value>file:///data/hadoop/hdfs/nn</value>

        </property>

        <property>

                <name>dfs.datanode.data.dir</name>

                <value>file:///data/hadoop/hdfs/dn</value>

        </property>

       <property>

                <name>fs.checkpoint.dir</name>

                <value>file:///data/hadoop/hdfs/snn</value>

        </property>

        <property>

                <name>fs.checkpoint.edits.dir</name>

                <value>file:///data/hadoop/hdfs/snn</value>

        </property>

        <property>

                <name>dfs.permissions</name>

                <value>false</value>

        </property>

</configuration>

mapred-site.xml

mapred-site.xml文件用于配置集群的MapReduce framework,此处应该指定使用yarn,另外的可用值还有local和classic。mapred-site.xml默认不存在,但有模版文件mapred-site.xml.template,只需要将其复制为mapred-site.xml即可。

#su - hadoop

$cp -fr /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

$vim /usr/local/hadoop/etc/hadoop/mapred-site.xml

<configuration>

        <property>

                <name>mapreduce.framework.name</name>

                <value>yarn</value>

        </property>

</configuration>

yarn-site.xml

yarn-site.xml用于配置YARN进程及YARN的相关属性,首先需要指定ResourceManager守护进程的主机和监听的端口(这里ResourceManager准备安装在NameNode节点);其次需要指定ResourceMnager使用的scheduler,以及NodeManager的辅助服务。一个简要的配置示例如下所示:

$vim /usr/local/hadoop/etc/hadoop/yarn-site.xml

<configuration>

    <property>

        <name>yarn.resourcemanager.address</name>

        <value>master:8032</value>

    </property>

    <property>

        <name>yarn.resourcemanager.scheduler.address</name>

        <value>master:8030</value>

    </property>

    <property>

        <name>yarn.resourcemanager.resource-tracker.address</name>

        <value>master:8031</value>

    </property>

    <property>

        <name>yarn.resourcemanager.admin.address</name>

        <value>master:8033</value>

    </property>

    <property>

        <name>yarn.resourcemanager.webapp.address</name>

        <value>master:8088</value>

    </property>

    <property><name>yarn.nodemanager.aux-services</name>

        <value>mapreduce_shuffle</value>

    </property>

    <property>

        <name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name>

        <value>org.apache.hadoop.mapred.ShuffleHandler</value>

    </property>

    <property>

        <name>yarn.resourcemanager.scheduler.class</name>

        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>

    </property>

</configuration>

hadoop-env.sh和yarn-env.sh

Hadoop的各个守护进程依赖于JAVA_HOME环境变量,如果有类似于前面步骤中通过/etc/profile.d/java.sh全局配置定义的JAVA_HOME变量即可正常使用。不过,如果想为Hadoop定义依赖到特定JAVA环境,也可以编辑这两个脚本文件,为其JAVA_HOME取消注释并配置合适的值即可。此外,Hadoop大多数守护进程默认使用的堆大小为1GB,但现实应用中,可能需要对其各类进程的堆内存大小做出调整,这只需要编辑此两者文件中相关环境变量值即可,例如HADOOP_HEAPSIZE、HADOOP_JOB_HISTORY_HEADPSIZE、JAVA_HEAP_SIZE和YARN_HEAP_SIZE等。

slaves文件

slaves文件存储于了当前集群的所有slave节点的列表,默认值为localhost。这里我打算在三个节点都安装DataNode,所以都添加进去即可。

#su - hadoop

$vim /usr/local/hadoop/etc/hadoop/slaves

hadoop-nn

hadoop-snn

hadoop-dn-01

到目前为止,第一个节点(Master)已经配置好了。在hadoop集群中,所有节点的配置都应该是一样的,前面我们也为slaves节点创建了Hadoop用户、数据目录以及日志目录等配置。

接下来就是把Master节点的配置文件都同步到所有Slaves即可。

#su - hadoop

$scp /usr/local/hadoop/etc/hadoop/* [email protected]:/usr/local/hadoop/etc/hadoop/

$scp /usr/local/hadoop/etc/hadoop/* [email protected]:/usr/local/hadoop/etc/hadoop/

 

五、格式化HDFS

在HDFS的NameNode启动之前需要先初始化其用于存储数据的目录,如果hdfs-site.xml中dfs.namenode.name.dir属性指定的目录不存在,格式化命令会自动创建之;如果事先存在,请确保其权限设置正确,此时格式操作会清除其内部的所有数据并重新建立一个新的文件系统。需要以hdfs用户的身份执行如下命令。

[[email protected] ~]$hdfs namenode -format

其输出会有大量的信息输出,如果显示出类似”INFO common.Storage: Storage directory /data/hadoop/hdfs/nn has been successfully formatted.“的结果表示格式化操作已经完成。

 

六、启动Hadoop集群

启动Hadood集群的方法有两种:一是在各节点分别启动需要启动的服务,二是在NameNode节点启动整个集群(推荐方法)。

1、分别启动方式

Master节点需要启动HDFS的NameNode、SecondaryNameNode、DataNode服务,以及YARN的ResourceManager服务。

[[email protected] ~]$hadoop-daemon.sh start namenode

[[email protected] ~]$hadoop-daemon.sh start secondarynamenode

[[email protected] ~]$hadoop-daemon.sh start datanode

[[email protected] ~]$yarn-daemon.sh start resourcemanager

各Slave节点需要启动HDFS的DataNode服务,以及YARN的NodeManager服务。

[[email protected] ~]$hadoop-daemon.sh start datanode

[[email protected] ~]$yarn-daemon.sh start nodemanager

2、集群启动方式

集群规模较大时,分别启动各节点的各服务过于繁琐和低效,为此,Hadoop专门提供了start-dfs.sh和stop-dfs.sh来启动及停止整个hdfs集群,以及start-yarn.sh和stop-yarn.sh来启动及停止整个yarn集群。

[[email protected] ~]$start-dfs.sh

[[email protected] ~]$start-yarn.sh

较早版本的Hadoop会提供start-all.sh和stop-all.sh脚本来统一控制hdfs和mapreduce,但Hadoop 2.0及之后的版本不建议再使用此种方式。

2.1、启动HDFS集群

[[email protected] ~]$start-dfs.sh

Starting namenodes on [master]

master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-namenode-hadoop-nn.out

hadoop-dn-01: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-dn-01.out

hadoop-nn: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-nn.out

hadoop-snn: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-snn.out

Starting secondary namenodes [0.0.0.0]

0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop-nn.out

HDFS集群启动完成后,可在各节点以jps命令等验证各进程是否正常运行,也可以通过Web UI来检查集群的运行状态。

查看NameNode节点启动的进程:

hadoop-nn:

[[email protected] ~]$ jps

14576 NameNode

14887 SecondaryNameNode

14714 DataNode

15018 Jps

[[email protected] ~]$ netstat -anplt | grep java

(Not all processes could be identified, non-owned process info

will not be shown, you would have to be root to see it all.)

tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      16468/java          

tcp        0      0 127.0.0.1:58545         0.0.0.0:*               LISTEN      16290/java          

tcp        0      0 10.10.0.186:8020        0.0.0.0:*               LISTEN      16146/java          

tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      16146/java          

tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      16290/java          

tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      16290/java          

tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      16290/java          

tcp        0      0 10.10.0.186:32565       10.10.0.186:8020        ESTABLISHED 16290/java          

tcp        0      0 10.10.0.186:8020        10.10.0.186:32565       ESTABLISHED 16146/java          

tcp        0      0 10.10.0.186:8020        10.10.0.188:11681       ESTABLISHED 16146/java          

tcp        0      0 10.10.0.186:8020        10.10.0.187:57112       ESTABLISHED 16146/java          

查看DataNode节点启动进程:

hadoop-snn:

[[email protected] ~]$ jps

741 DataNode

862 Jps

[[email protected] ~]$ netstat -anplt | grep java

(Not all processes could be identified, non-owned process info

will not be shown, you would have to be root to see it all.)

tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      1042/java          

tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      1042/java          

tcp        0      0 127.0.0.1:18975         0.0.0.0:*               LISTEN      1042/java          

tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      1042/java          

tcp        0      0 10.10.0.187:57112       10.10.0.186:8020        ESTABLISHED 1042/java

hadoop-dn-01:

[[email protected] ~]$ jps

410 DataNode

539 Jps

通过JPS命令和开启的端口基本可以看出,NameNode、SecondaryNameNode、DataNode进程各自开启的对应端口。另外,可以看到DataNode都正常连接到了NameNode的8020端口。如果相关节点起不来,可能是权限不对,或者相关目录没有创建,具体可以看相关节点的日志:/usr/local/hadoop/logs/*.log。

通过NameNode节点的http://hadoop-nn:50070访问Web UI界面:

可以看到3个DataNode节点都运行正常。

此时其实HDFS集群已经好了,就可以往里面存储数据了,下面简单使用HDFS命令演示一下:

在HDFS集群创建目录:

[[email protected] ~]$ hdfs dfs -mkdir /test    

如果出现报错:mkdir: Cannot create directory /test. Name node is in safe mode.

则需要执行命令:

[[email protected] ~]$ hadoop dfsadmin -safemode leave

上传文件到HDFS集群:

[[email protected] ~]$ hdfs dfs -put /etc/fstab /test/fstab

[[email protected] ~]$ hdfs dfs -put /etc/init.d/functions /test/functions

查看HDFS集群的文件:

[[email protected] ~]$ hdfs dfs -ls /test/

Found 2 items

-rw-r--r--   2 hadoop supergroup        524 2017-06-14 01:49 /test/fstab

-rw-r--r--   2 hadoop supergroup      13948 2017-06-14 01:50 /test/functions

然后我们再看一下Hadoop Web UI界面:

 

可以看到Blocks字段,在所有节点共占用4个块,HDFS默认未64M一个块大小。由于我们上传的文件太小,所以也没有做切割,我们再启动集群时设置的是2个副本,所以这里就相当于存储了两份。

HDFS集群管理命令

6.2、启动YARN集群

[[email protected] ~]$start-yarn.sh

starting yarn daemons

starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-resourcemanager-hadoop-nn.out

hadoop-nn: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-nn.out

hadoop-dn-01: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-dn-01.out

hadoop-snn: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-snn.out

YARN集群启动完成后,可在各节点以jps命令等验证各进程是否正常运行。

hadoop-nn:

[[email protected] logs]$jps

4120 DataNode

1898 NameNode

2474 SecondaryNameNode

2922 NodeManager

2701 ResourceManager

5646 Jps

hadoop-snn:

[[email protected] ~]$ jps

10415 NodeManager

11251 Jps

9984 DataNode

hadoop-dn-01:

[[email protected] ~]$ jps

10626 NodeManager

10020 DataNode

11423 Jps

通过JPS命令和开启的端口基本可以看出ResourceManager、NodeManager进程都各自启动。另外,NodeManager会在对应的DataNode节点都启动。

通过ResourceManager节点的http://hadoop-nn:8088访问Web UI界面:

YARN集群管理命令

YARN命令有许多子命令,大体可分为用户命令和管理命令两类。直接运行yarn命令,可显示其简单使用语法及各子命令的简单介绍:

这些命令中,jar、application、node、logs、classpath和version是常用的用户命令,而resourcemanager、nodemanager、proxyserver、rmadmin和daemonlog是较为常用的管理类命令。

 

七、运行YARN应用程序

YARN应用程序(Application)可以是一个简单的shell脚本、MapReduce作业或其它任意类型的作业。需要运行应用程序时,客户端需要事先生成一个ApplicationMaster,而后客户端把application context提交给ResourceManager,随后RM向AM分配内存及运行应用程序的容器。大体来说,此过程分为六个阶段。

1、Application初始化及提交;

2、分配内存并启动AM;

3、AM注册及资源分配;

4、启动并监控容器;

5、Application进度报告;

6、Application运行完成;

下面我们来利用搭建好的Hadoop平台处理一个任务,看一下这个流程是怎样的。Hadoop安装包默认提供了以下运行示例,如下操作:

[[email protected] logs]$yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.4.jar

An example program must be given as the first argument.

Valid program names are:

  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.

  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.

  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

  dbcount: An example job that count the pageview counts from a database.

  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.

  grep: A map/reduce program that counts the matches of a regex in the input.

  join: A job that effects a join over sorted, equally partitioned datasets

  multifilewc: A job that counts words from several files.

  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.

  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.

  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.

  randomwriter: A map/reduce program that writes 10GB of random data per node.

  secondarysort: An example defining a secondary sort to the reduce.

  sort: A map/reduce program that sorts the data written by the random writer.

  sudoku: A sudoku solver.

  teragen: Generate data for the terasort

  terasort: Run the terasort

  teravalidate: Checking results of terasort

  wordcount: A map/reduce program that counts the words in the input files.

  wordmean: A map/reduce program that counts the average length of the words in the input files.

  wordmedian: A map/reduce program that counts the median length of the words in the input files.

  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

我们找一个比较好理解的wordcount进行测试,还记得我们刚开始提供一个funcations文件到了HDFS集群中,下面我们就把funcations这个文件进行单词统计处理,示例如下:

[[email protected] logs]$yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.4.jar wordcount /test/fstab /test/functions /test/wc

18/04/26 20:11:52 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.10.91:8032

18/04/26 20:11:53 INFO input.FileInputFormat: Total input files to process : 2

18/04/26 20:11:53 INFO mapreduce.JobSubmitter: number of splits:2

18/04/26 20:11:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1524741305301_0001

18/04/26 20:11:54 INFO impl.YarnClientImpl: Submitted application application_1524741305301_0001

18/04/26 20:11:54 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1524741305301_0001/

18/04/26 20:11:54 INFO mapreduce.Job: Running job: job_1524741305301_0001

18/04/26 20:12:04 INFO mapreduce.Job: Job job_1524741305301_0001 running in uber mode : false

18/04/26 20:12:04 INFO mapreduce.Job:  map 0% reduce 0%

18/04/26 20:12:11 INFO mapreduce.Job:  map 50% reduce 0%

18/04/26 20:12:12 INFO mapreduce.Job:  map 100% reduce 0%

18/04/26 20:12:18 INFO mapreduce.Job:  map 100% reduce 100%

18/04/26 20:12:18 INFO mapreduce.Job: Job job_1524741305301_0001 completed successfully

18/04/26 20:12:18 INFO mapreduce.Job: Counters: 49

        File System Counters

                FILE: Number of bytes read=10779

                FILE: Number of bytes written=494063

                FILE: Number of read operations=0

                FILE: Number of large read operations=0

                FILE: Number of write operations=0

                HDFS: Number of bytes read=15880

                HDFS: Number of bytes written=8015

                HDFS: Number of read operations=9

                HDFS: Number of large read operations=0

                HDFS: Number of write operations=2

        Job Counters

                Launched map tasks=2

                Launched reduce tasks=1

                Data-local map tasks=2

                Total time spent by all maps in occupied slots (ms)=8926

                Total time spent by all reduces in occupied slots (ms)=5022

                Total time spent by all map tasks (ms)=8926

                Total time spent by all reduce tasks (ms)=5022

                Total vcore-milliseconds taken by all map tasks=8926

                Total vcore-milliseconds taken by all reduce tasks=5022

                Total megabyte-milliseconds taken by all map tasks=9140224

                Total megabyte-milliseconds taken by all reduce tasks=5142528

        Map-Reduce Framework

                Map input records=666

                Map output records=2210

                Map output bytes=22507

                Map output materialized bytes=10785

                Input split bytes=192

                Combine input records=2210

                Combine output records=692

                Reduce input groups=686

                Reduce shuffle bytes=10785

                Reduce input records=692

                Reduce output records=686

                Spilled Records=1384

                Shuffled Maps =2

                Failed Shuffles=0

                Merged Map outputs=2

                GC time elapsed (ms)=201

                CPU time spent (ms)=2080

                Physical memory (bytes) snapshot=711757824

                Virtual memory (bytes) snapshot=6615576576

                Total committed heap usage (bytes)=492306432

        Shuffle Errors

                BAD_ID=0

                CONNECTION=0

                IO_ERROR=0

                WRONG_LENGTH=0

                WRONG_MAP=0

                WRONG_REDUCE=0

        File Input Format Counters

                Bytes Read=15688

        File Output Format Counters

                Bytes Written=8015

我们把统计结果放到HDFS集群的/test/wc目录下。另外,注意当输出目录存在时执行任务会报错。

任务运行时,你可以去Hadoop管理平台(8088端口)看一下会有如下类似的输出信息,包括此次应用名称,运行用户、任务名称、应用类型、执行时间、执行状态、以及处理进度。

然后我们可以看一下/test/wc目录下有什么:

[[email protected] logs]$hdfs dfs -ls /test/wc

Found 2 items

-rw-r--r--  2 hadoop supergroup          0 2018-04-26 20:12 /test/wc/_SUCCESS

-rw-r--r--  2 hadoop supergroup      8015 2018-04-26 20:12 /test/wc/part-r-00000

看一下单词统计结果:

八、开启历史服务

当运行过Yarn任务之后,在Web UI界面可以查看其状态信息。但是当ResourceManager重启之后,这些任务就不可见了。所以可以通过开启Hadoop历史服务来查看历史任务信息。

Hadoop开启历史服务可以在web页面上查看Yarn上执行job情况的详细信息。可以通过历史服务器查看已经运行完的Mapreduce作业记录,比如用了多少个Map、用了多少个Reduce、作业提交时间、作业启动时间、作业完成时间等信息。

[[email protected] logs]$mr-jobhistory-daemon.sh start historyserver

starting historyserver, logging to /usr/local/hadoop/logs/mapred-hadoop-historyserver-c0a80a5b.nykjsrv.cn.out

[[email protected] logs]$jps

9318 JobHistoryServer

4120 DataNode

1898 NameNode

2474 SecondaryNameNode

2922 NodeManager

2701 ResourceManager

9390 Jps

JobHistoryServer开启后,可以通过Web页面查看历史服务器:

历史服务器的Web端口默认是19888,可以查看Web界面。你可以多执行几次Yarn任务,可以通过History点击跳到历史页面,查看其任务详情。

但是在上面所显示的某一个Job任务页面的最下面,Map和Reduce个数的链接上,点击进入Map的详细信息页面,再查看某一个Map或者Reduce的详细日志是看不到的,是因为没有开启日志聚集服务。

 

九、开启日志聚集

MapReduce是在各个机器上运行的,在运行过程中产生的日志存在于各个机器上,为了能够统一查看各个机器的运行日志,将日志集中存放在HDFS上,这个过程就是日志聚集。

配置日志聚集功能,Hadoop默认是不启用日志聚集的,在yarn-site.xml文件里添加配置启用日志聚集。

$vim /usr/local/hadoop/etc/hadoop/yarn-site.xml

<property>

        <name>yarn.log-aggregation-enable</name>

        <value>true</value>

</property>

<property>

        <name>yarn.log-aggregation.retain-seconds</name>

        <value>106800</value>

</property>

yarn.log-aggregation-enable:是否启用日志聚集功能。

yarn.log-aggregation.retain-seconds:设置日志保留时间,单位是秒。

将配置文件分发到其他节点:

[[email protected] ~]$ su - hadoop

[[email protected] ~]$ scp /usr/local/hadoop/etc/hadoop/* [email protected]:/usr/local/hadoop/etc/hadoop/

[[email protected] ~]$ scp /usr/local/hadoop/etc/hadoop/* [email protected]:/usr/local/hadoop/etc/hadoop/

重启Yarn进程:

[[email protected] ~]$ stop-yarn.sh

[[email protected] ~]$ start-yarn.sh

重启HistoryServer进程:

[[email protected] ~]$ mr-jobhistory-daemon.sh stop historyserver

[[email protected] ~]$ mr-jobhistory-daemon.sh start historyserver

测试日志聚集,运行一个demo MapReduce,使之产生日志:

[[email protected] ~]$ yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.3.jar wordcount /test/fstab /test/wc1

运行Job后,就可以在历史服务器Web页面查看各个Map和Reduce的日志了。

 

十、内存调整

hadoop由512M调整为4096M

[[email protected] ~]$/usr/local/hadoop/etc/hadoop/hadoop-env.sh

export HADOOP_PORTMAP_OPTS="-Xmx4096m $HADOOP_PORTMAP_OPTS"

export HADOOP_CLIENT_OPTS="-Xmx4096m $HADOOP_CLIENT_OPTS"

 

yarn由2048M调整为4096M

[[email protected] ~]$/usr/local/hadoop/etc/hadoop/yarn-env.sh

JAVA_HEAP_MAX=-Xmx4096m