1.hadoop集群规化
ip | 主机名 | 安装软件 | 角色 | 运行进程 |
---|---|---|---|---|
10.124.147.22 | hadoop1 | jdk、zookeeper、hadoop | namenode/zookeeper/jobhistoryserver | DFSZKFailoverController、NameNode、JobHistoryServer、QuorumPeerMain |
10.124.147.23 | hadoop2 | jdk、zookeeper、hadoop | namenode/zookeeper | DFSZKFailoverController、NameNode、QuorumPeerMain |
10.124.147.32 | hadoop3 | jdk、zookeeper、hadoop | resourcemanager/zookeeper | ResourceManager、QuorumPeerMain |
10.124.147.33 | hadoop4 | jdk、zookeeper、hadoop | resourcemanager/zookeeper | ResourceManager、QuorumPeerMain |
10.110.92.161 | hadoop5 | jdk、hadoop | datanode/journalnode | NodeManager、JournalNode、DataNode |
10.110.92.162 | hadoop6 | jdk、hadoop | datanode/journalnode | NodeManager、JournalNode、DataNode |
10.122.147.37 | hadoop7 | jdk、hadoop | datanode/journalnode | NodeManager、JournalNode、DataNode |
2.基本环境
system os: centos 6.5java
hadoop: 2.7.3node
zoopkeeper: 3.4.12mysql
jdk: 1.8.0linux
3.环境准备
3.1 hosts设定
[root@10-124-147-23 local]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 10.124.147.22 hadoop1 10-124-147-22 10.124.147.23 hadoop2 10-124-147-23 10.124.147.32 hadoop3 10-124-147-32 10.124.147.33 hadoop4 10-124-147-33 10.110.92.161 hadoop5 10-110-92-161 10.110.92.162 hadoop6 10-110-92-162 10.122.147.37 hadoop7 10-122-147-37
在此须要注意两点web
- 127.0.0.1以后不要有主机名,好比上面的10-124-147-22的
- 最好将ipv6地址栏的localhosts删除
- 此处除了hadoop1以外,我还设定了10-124-147-22,是由于不想更改主机名,实际实际的时候,直接进行hostname更改便可
3.2 java环境安装
3.2.1 jdk安装包解压
[root@10-124-147-23 letv]# tar xvf jdk-8u141-linux-x64.tar.gz [root@10-124-147-23 letv]# ln -svfn /letv/jdk1.8.0_141 /usr/local/java
3.2.2 profile环境的变动
[root@10-124-147-23 letv]# tail -3 /etc/profile export JAVA_HOME=/usr/local/java export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH [root@10-124-147-23 letv]# source /etc/profile
3.3 zookeeper集群的安装
3.3.1 zookeeper安装包的解压
[root@10-124-147-23 letv]# tar xvf zookeeper-3.4.12.tar.gz [root@10-124-147-23 letv]# ln -svnf /letv/zookeeper-3.4.12 /usr/local/zookeeper [root@10-124-147-23 letv]# cd /usr/local/zookeeper/conf [root@10-124-147-23 conf]# ll total 16 -rw-rw-r-- 1 1000 1000 535 Mar 27 12:32 configuration.xsl -rw-rw-r-- 1 1000 1000 2161 Mar 27 12:32 log4j.properties -rw-rw-r-- 1 1000 1000 922 Mar 27 12:32 zoo_sample.cfg [root@10-124-147-23 conf]# cp zoo_sample.cfg zoo.cfg
3.3.2 zoo.cfg配置文件修改
[root@10-124-147-23 conf]# grep ^[^#] zoo.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/usr/local/zookeeper/data clientPort=2181 server.1=hadoop1:2888:3888 server.2=hadoop2:2888:3888 server.3=hadoop3:2888:3888 server.4=hadoop4:2888:3888
修改dataDir
值,由于同时要创建zookeeper集群,下面写下对应的server地址sql
[root@10-124-147-23 conf]# echo 1 > /usr/local/zookeeper/data/myid
将当前主机在zookeeper集群中的id值写入,而后启动zookeepershell
3.3.3 启动zookeeper
[root@10-124-147-23 bin]# pwd /usr/local/zookeeper/bin [root@10-124-147-23 bin]# ./zkServer.sh start
同理,启动其它主机的zookeeper,操做同上,惟一区别的就是/usr/local/zookeeper/data/myid
中的值,须要彼此不同express
3.3.4 查看zookeeper状态
[root@10-124-147-23 bin]# ./zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: follower [root@10-124-147-33 ~]# /usr/local/zookeeper/bin/zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: leader
4.hadoop的安装
hadoop2.0官方提供了两种HDFS HA的解决方案,一种是NFS,另外一种是QJM。这里咱们使用简单的QJM。在该方案中,主备NameNode之间经过一组JournalNode同步元数据信息,一条数据只要成功写入多数JournalNode即认为写入成功。JournalNode的个数须要为奇数个apache
4.1 hadoop解压
[root@10-124-147-33 letv]# tar xvf hadoop-2.7.6.tar.gz [root@10-124-147-23 ~]# ln -svnf /letv/hadoop-2.7.6 /usr/local/hadoop
4.2 hadoop环境
本次安装hadoop
,只须要指定java
环境和hadoop
环境便可,由于zookeeper
和hadoop
都须要运行java
环境,上述安装环境已经指定bootstrap
[root@10-124-147-23 letv]# tail -3 /etc/profile export JAVA_HOME=/usr/local/java export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
4.3 hadoop配置文件的修改
hadoop配置文件位于etc/hadoop目录之下,主要控制文件有如下6个
4.3.1 hadoop-env.sh
[root@10-124-147-23 ~]# grep JAVA_HOME /usr/local/hadoop/etc/hadoop/hadoop-env.sh # The only required environment variable is JAVA_HOME. All others are # set JAVA_HOME in this file, so that it is correctly defined on export JAVA_HOME=/usr/local/java
此处须要指向java
环境的实际路径,不能直接使用${JAVA_HOME}
来指定,此处并不能直接识别此变量,具体缘由未知。
4.3.2 hdfs-site.xml
[root@10-124-147-23 ~]# cat /usr/local/hadoop/etc/hadoop/hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!--指定hdfs的nameservice为ns1,须要和core-site.xml中的保持一致 --> <property> <name>dfs.nameservices</name> <value>ns1</value> </property> <!-- ns1下面有两个NameNode,分别是nn1,nn2 --> <property> <name>dfs.ha.namenodes.ns1</name> <value>nn1,nn2</value> </property> <!-- nn1的RPC通讯地址 --> <property> <name>dfs.namenode.rpc-address.ns1.nn1</name> <value>hadoop1:9000</value> </property> <!-- nn1的http通讯地址 --> <property> <name>dfs.namenode.http-address.ns1.nn1</name> <value>hadoop1:50070</value> </property> <!-- nn2的RPC通讯地址 --> <property> <name>dfs.namenode.rpc-address.ns1.nn2</name> <value>hadoop2:9000</value> </property> <!-- nn2的http通讯地址 --> <property> <name>dfs.namenode.http-address.ns1.nn2</name> <value>hadoop2:50070</value> </property> <!-- 指定NameNode的元数据在JournalNode上的存放位置 --> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://hadoop5:8485;hadoop6:8485;hadoop7:8485/ns1</value> </property> <!-- 指定JournalNode在本地磁盘存放数据的位置 --> <property> <name>dfs.journalnode.edits.dir</name> <value>/usr/local/hadoop/data/journaldata</value> </property> <!-- 开启NameNode失败自动切换 --> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <!-- 配置失败自动切换实现方式 --> <property> <name>dfs.client.failover.proxy.provider.ns1</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <!-- 配置隔离机制方法,多个机制用换行分割,即每一个机制占用一行--> <property> <name>dfs.ha.fencing.methods</name> <value> sshfence shell(/bin/true) </value> </property> <!-- 使用sshfence隔离机制时须要ssh免登录 --> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/hadoop/.ssh/id_rsa</value> </property> <!-- 配置sshfence隔离机制超时时间 --> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> </property> </configuration>
在hadoop 3中,hdfs
的web通信端口50070
已经变动为9870
4.3.3 mapred-site.xml
[root@10-124-147-23 ~]# cat /usr/local/hadoop/etc/hadoop/mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- 指定mr框架为yarn方式 --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop1:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop1:19888</value> </property> </configuration>
4.3.4 core-site.xml
[root@10-124-147-23 ~]# cat /usr/local/hadoop/etc/hadoop/core-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- 指定hdfs的nameservice为ns1 --> <property> <name>fs.defaultFS</name> <value>hdfs://ns1</value> </property> <!-- 指定hadoop临时目录 --> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/data/tmp</value> </property> <!-- 指定zookeeper地址 --> <property> <name>ha.zookeeper.quorum</name> <value>hadoop1:2181,hadoop2:2181,hadoop3:2181,hadoop4:2181</value> </property> </configuration>
4.3.5 yarn-site.xml
[root@10-124-147-23 ~]# cat /usr/local/hadoop/etc/hadoop/yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <!-- 开启RM高可靠 --> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <!-- 指定RM的cluster id --> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yrc</value> </property> <!-- 指定RM的名字 --> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <!-- 分别指定RM的地址 --> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>hadoop3</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>hadoop4</value> </property> <!-- 指定zk集群地址 --> <property> <name>yarn.resourcemanager.zk-address</name> <value>hadoop1:2181,hadoop2:2181,hadoop3:2181,hadoop4:2181</value> </property> <!-- 在RM节点接管后,任务状态能够恢复--> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <!-- 设置存储yarn中状态信息的地方,默认为hdfs,这里设置为zookeeper--> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <!-- 使在yarn上可以运行mapreduce_shuffle程序--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
4.3.6 slave
[root@10-124-147-23 ~]# cat /usr/local/hadoop/etc/hadoop/slaves hadoop5 hadoop6 hadoop7
这里的slaves分两种,对于hadoop1
而言,其为namenode
,因此其slaves是hdfs系统中的slaves,也就是datanode
,在本文中,设定hadoop5
,hadoop6
,hadoop7
为datanode
而对于hadoop3
而言,其为resourcemanager
,故其slaves是yarn系统中的slaves,也就是nodemanager
,nodemanager
是对每机机器的资源状态进行监控,同时将监控结果向resourcemanager
进行报告,通常一台datanode
上面都会有着nodemanager
进程。
本文中journalnode
,nodemanager
,datanode
三个角色都是位于同一机器,实际上journalnode
只是参与到namenode
HA模式中,与后二者并不挂钩,由于集群中不容许同时有两个namenode
同时工做 ,不然数据地址空间就会出错,可是为了HA,因此standby
的namenode
须要保持与active
状态的namenode
数据一致,两个namenode
为了数据同步,会经过一组称做journalnodes
的独立进程进行相互通讯。当active状态的namenode
的命名空间有任何修改时,会告知大部分的journalnodes
进程。standby
状态的namenode
有能力读取journalnodes
中的变动信息,而且一直监控edit log
的变化,把变化应用于本身的命名空间。standby
能够确保在集群出错时,命名空间状态已经彻底同步了。
通常正常生产中,journalnode
设定为5个,基本上zookeeper
个数也是设定为5个,文中我zookeeper
设定4个其实不太合理。
综上,因此对于hadoop3而言,其slaves也能够设定为hadoop5
,hadoop6
,hadoop7
因此本文中全部节点,hadoop配置能够保持一致
4.3.7 ssh-key验证
实际生产中其实只须要namenode
之间ssh-key免密便可,实验环境中,由于须要在namenode
中直接经过脚本启动其它slaves节点,因此须要进行ssh-key免密的设定
主要的设定的是datanode中须要有两个namenode
和两个resourcemanager
的ssh-key信息,同时namenode
和resourcemanger
自身也须要自身的ssh-key,以便启动,因此文中hadoop1
,hadoop2
,hadoop3
,hadoop4
4台主机的hadoop用户的ssh-key须要放置于每一台主机hadoop用户之下。
[root@10-124-147-23 ~]# useradd hadoop [hadoop@10-124-147-23 ~]$ ssh-keygen [hadoop@10-124-147-23 ~]$ cat .ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyQ9T7zTAlhqFM9XQoHTPzwfgDwAzwLUgqe7NnDpufiirK9QqCdLZFNE6PNtN7oNyWMu3r9UE5aMYv9uLMu22m+8xyTXXINYfPW9hsityu/N6a9DwhEC9joNS3DVjBR8YRMQG2sxtDbebbaG2R4BK77DZyoB0uyqRItxLIMYTiZ/00LCMJCoAINUQVzOrteVpLHAviRNnrwZewoD2sUgeZU0A0hT++RiE/prqI+jIFJSacduVaKsabRu/zKan9b8coC1b+GJnypqk+CPyahJL+0jgb9Jgrjm2Lt4erbBo/k3u16nSJpSoSdf7kr5HKv3ds5+fwcMQV5oKV1jv6ximIw== hadoop@10-124-147-23
而后切换至其它节点主要,依次建立hadoop用户,将namenode
节点ssh-key写入
[root@10-124-147-33 letv]# useradd hadoop [hadoop@10-124-147-33 ~]$ mkdir .ssh [hadoop@10-124-147-33 ~]$ chmod g-w .ssh 以上这一步很是重要,由于正常状况下须要对hadoop用户进行密码设定以后,而后再使用ssh-copy-id将key自动写入到其它主机中,咱们并无对hadoop用户设定密码,而ssh中为了安全,g与o用户是对.ssh目录均无w权限的,因此须要将.ssh目录中g与o用户的w权限去掉。相似的还在后面中的authorized_keys文件 [hadoop@10-124-147-33 ~]$ vim .ssh/authorized_keys 将hadoop1中的id_rsa.pub写入 [hadoop@10-124-147-33 ~]$ chmod 600 .ssh/authorized_keys [hadoop@10-124-147-33 ~]$ ll .ssh/authorized_keys -rw------- 1 hadoop hadoop 1608 Jul 19 11:43 .ssh/authorized_keys [hadoop@10-124-147-33 ~]$ ll -d .ssh/ drwxr-xr-x 2 hadoop hadoop 4096 Jul 19 11:43 .ssh/
4.3.8 hadoop 文件copy
将hadoop1中的hadoop目录整个scp至其它节点,同时注意/etc/profile文件,以及部分节点上面的java环境
4.4 hadoop的启动
4.4.1 启动journalnode
[hadoop@10-110-92-161 ~]$ cd /usr/local/hadoop/ [hadoop@10-110-92-161 hadoop]$ sbin/hadoop-daemon.sh start journalnode [hadoop@10-110-92-161 hadoop]$ jps 1557 JournalNode 22439 Jps
三个节点的journalnode
都要启动
4.4.2 格式化namenode
[hadoop@10-124-147-22 hadoop]$ hdfs namenode -format
4.4.3 启动active namenode
[hadoop@10-124-147-22 hadoop]$ sbin/hadoop-daemon.sh start namenode [hadoop@10-124-147-22 hadoop]$ jps 2580 DFSZKFailoverController 29590 Jps 1487 NameNode
4.4.4 复制active namenode信息至standby namenode
格式化active namenode
后会在根据core-site.xml中的hadoop.tmp.dir配置生成个文件,可能直接copy至standby namenode
,也能够经过选项-bootstrapStandby
直接从active namenode
拉取,使用命令拉取的前提是active namenode
进程须要启动
[hadoop@10-124-147-23 hadoop]$ hdfs namenode -bootstrapStandby [hadoop@10-124-147-23 hadoop]$ sbin/hadoop-daemon.sh start namenode [hadoop@10-124-147-23 hadoop]$ jps 899 NameNode 11846 Jps 1353 DFSZKFailoverController
4.4.5 格式化zkfc
[hadoop@10-124-147-22 hadoop]$ hdfs zkfc -formatZK
4.4.6 启动hdfs
[hadoop@10-124-147-22 hadoop]$ sbin/start-dfs.sh
4.4.7 启动resourcemanager
[hadoop@10-124-147-32 hadoop]$ pwd /usr/local/hadoop [hadoop@10-124-147-32 hadoop]$ resourcemanager sbin/start-yarn.sh [hadoop@10-124-147-32 hadoop]$ jps 30882 ResourceManager 26868 Jps
4.4.8 启动standby resourcemanager
[hadoop@10-124-147-33 hadoop]$ pwd /usr/local/hadoop [hadoop@10-124-147-33 hadoop]$ sbin/yarn-daemon.sh start resourcemanager [hadoop@10-124-147-33 hadoop]$ jps 22675 Jps 26980 ResourceManager
4.4.9 集群状态检测
[hadoop@10-124-147-22 hadoop]$ hdfs haadmin -getServiceState nn1 active [hadoop@10-124-147-22 hadoop]$ hdfs haadmin -getServiceState nn2 standby [hadoop@10-124-147-22 hadoop]$ yarn rmadmin -getServiceState rm1 active [hadoop@10-124-147-22 hadoop]$ yarn rmadmin -getServiceState rm2 standby
此时,能够经过web访问active namenode
的50070
端口和active resourcemanager
的8080
端口
4.4.10 启动history进程
在active namenode启动便可
[hadoop@10-124-147-22 hadoop]$ sbin/mr-jobhistory-daemon.sh start historyserver [hadoop@10-124-147-22 hadoop]$ pwd /usr/local/hadoop [hadoop@10-124-147-22 hadoop]$ jps 2580 DFSZKFailoverController 31781 Jps 2711 JobHistoryServer 1487 NameNode
4.5 hadoop的简单使用
4.5.1 上传文件于hdfs
新建一个文件/tmp/test.txt [hadoop@10-124-147-22 hadoop]$ cat /tmp/test.txt hello world hello mysql hello mongo hello elasticsearch hello hadoop hello hdfs hello yarn hello namenode hello datanode hello resourcemanager hello nodemanager hello journalnode [hadoop@10-124-147-22 hadoop]$ hadoop fs -put /tmp/test.txt /wordcount 将/tmp/test.txt文件上传于hdfs中,并重命名为wordcount [hadoop@10-124-147-22 hadoop]$ hadoop fs -cat /wordcount hello world hello mysql hello mongo hello elasticsearch hello hadoop hello hdfs hello yarn hello namenode hello datanode hello resourcemanager hello nodemanager hello journalnode
4.5.2 hadoop任务测试
hadoop中提供了简单的任务测试jar包,能够进行测试
[hadoop@10-124-147-22 hadoop]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar pi 2 10 Number of Maps = 2 Samples per Map = 10 Wrote input for Map #0 Wrote input for Map #1 Starting Job 18/07/23 15:41:47 INFO input.FileInputFormat: Total input paths to process : 2 18/07/23 15:41:47 INFO mapreduce.JobSubmitter: number of splits:2 18/07/23 15:41:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1532056892547_0003 18/07/23 15:41:47 INFO impl.YarnClientImpl: Submitted application application_1532056892547_0003 18/07/23 15:41:47 INFO mapreduce.Job: The url to track the job: http://hadoop3:8088/proxy/application_1532056892547_0003/ 18/07/23 15:41:47 INFO mapreduce.Job: Running job: job_1532056892547_0003 18/07/23 15:41:53 INFO mapreduce.Job: Job job_1532056892547_0003 running in uber mode : false 18/07/23 15:41:53 INFO mapreduce.Job: map 0% reduce 0% 18/07/23 15:41:58 INFO mapreduce.Job: map 100% reduce 0% 18/07/23 15:42:03 INFO mapreduce.Job: map 100% reduce 100% 18/07/23 15:42:04 INFO mapreduce.Job: Job job_1532056892547_0003 completed successfully 18/07/23 15:42:05 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=50 FILE: Number of bytes written=376437 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=510 HDFS: Number of bytes written=215 HDFS: Number of read operations=11 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=5283 Total time spent by all reduces in occupied slots (ms)=2804 Total time spent by all map tasks (ms)=5283 Total time spent by all reduce tasks (ms)=2804 Total vcore-milliseconds taken by all map tasks=5283 Total vcore-milliseconds taken by all reduce tasks=2804 Total megabyte-milliseconds taken by all map tasks=5409792 Total megabyte-milliseconds taken by all reduce tasks=2871296 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=36 Map output materialized bytes=56 Input split bytes=274 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=56 Reduce input records=4 Reduce output records=0 Spilled Records=8 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=219 CPU time spent (ms)=3030 Physical memory (bytes) snapshot=752537600 Virtual memory (bytes) snapshot=6612717568 Total committed heap usage (bytes)=552075264 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97 Job Finished in 18.492 seconds Estimated value of Pi is 3.80000000000000000000
在job执行的时候,能够查看resourcemanger
web端的8088端口,上面能够看到job的完成进度
再执行一个word count任务
能够执行字母统计,将hdfs中的wordcount文件统计,并将结果输出到wordcount-to-output [hadoop@10-124-147-22 hadoop]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /wordcount /wordcount-to-output 18/07/23 15:45:12 INFO input.FileInputFormat: Total input paths to process : 1 18/07/23 15:45:13 INFO mapreduce.JobSubmitter: number of splits:1 18/07/23 15:45:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1532056892547_0004 18/07/23 15:45:13 INFO impl.YarnClientImpl: Submitted application application_1532056892547_0004 18/07/23 15:45:13 INFO mapreduce.Job: The url to track the job: http://hadoop3:8088/proxy/application_1532056892547_0004/ 18/07/23 15:45:13 INFO mapreduce.Job: Running job: job_1532056892547_0004 18/07/23 15:45:19 INFO mapreduce.Job: Job job_1532056892547_0004 running in uber mode : false 18/07/23 15:45:19 INFO mapreduce.Job: map 0% reduce 0% 18/07/23 15:45:23 INFO mapreduce.Job: map 100% reduce 0% 18/07/23 15:45:29 INFO mapreduce.Job: map 100% reduce 100% 18/07/23 15:45:29 INFO mapreduce.Job: Job job_1532056892547_0004 completed successfully 18/07/23 15:45:29 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=197 FILE: Number of bytes written=250631 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=264 HDFS: Number of bytes written=140 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=2492 Total time spent by all reduces in occupied slots (ms)=3007 Total time spent by all map tasks (ms)=2492 Total time spent by all reduce tasks (ms)=3007 Total vcore-milliseconds taken by all map tasks=2492 Total vcore-milliseconds taken by all reduce tasks=3007 Total megabyte-milliseconds taken by all map tasks=2551808 Total megabyte-milliseconds taken by all reduce tasks=3079168 Map-Reduce Framework Map input records=12 Map output records=24 Map output bytes=275 Map output materialized bytes=197 Input split bytes=85 Combine input records=24 Combine output records=13 Reduce input groups=13 Reduce shuffle bytes=197 Reduce input records=13 Reduce output records=13 Spilled Records=26 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=155 CPU time spent (ms)=2440 Physical memory (bytes) snapshot=465940480 Virtual memory (bytes) snapshot=4427837440 Total committed heap usage (bytes)=350224384 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=179 File Output Format Counters Bytes Written=140
执行结果
[hadoop@10-124-147-22 hadoop]$ hadoop fs -ls / Found 5 items drwxrwx--- - hadoop supergroup 0 2018-07-20 11:21 /tmp drwxr-xr-x - hadoop supergroup 0 2018-07-20 11:47 /user -rw-r--r-- 3 hadoop supergroup 179 2018-07-20 11:22 /wordcount drwxr-xr-x - hadoop supergroup 0 2018-07-23 15:45 /wordcount-to-output [hadoop@10-124-147-22 hadoop]$ hadoop fs -ls /wordcount-to-output Found 2 items -rw-r--r-- 3 hadoop supergroup 0 2018-07-23 15:45 /wordcount-to-output/_SUCCESS -rw-r--r-- 3 hadoop supergroup 140 2018-07-23 15:45 /wordcount-to-output/part-r-00000 [hadoop@10-124-147-22 hadoop]$ hadoop fs -cat /wordcount-to-output/part-r-00000 datanode 1 elasticsearch 1 hadoop 1 hdfs 1 hello 12 journalnode 1 mongo 1 mysql 1 namenode 1 nodemanager 1 resourcemanager 1 world 1 yarn 1
5.其它
5.1 hadoop3相对比hadoop2进程端口更变
Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820 Secondary NN ports: 50091 --> 9869, 50090 --> 9868 Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075 --> 9864 KMS service :16000 --> 9600
同时变动的还有slaves
文件,在hadoop2中的slaves
文件在hadoop3中变成works
文件
5.2生产中datanode的启动
生产中hadoop集群里面的datanode
通常都是几百上千台主机,实际上生产中的datanode
都是在各自主机中自行单独启动,并非直接经过namenode
进行启动,因此上面4.3.7中的ssh-key在实际生产中并不无那么多需求。同时journalnode
虽然消耗资源小,可是通常也不与datanode
分布于同一台主机中。