[TOC]java
文章均为本人技术笔记,转载请注明出处:
[1] https://segmentfault.com/u/yzwall
[2] blog.csdn.net/j_dark/node
PC:ubuntu 16.04.1 LTSpython
Docker version:17.03.1-ce OS/Arch:linux/amd64linux
Hadoop version:hadoop-2.7.3web
建立基于ubuntu镜像的容器container
,官方默认下载ubuntu最新精简版镜像;sudo docker run -ti container ubuntu
docker
修改默认源文件/etc/apt/source.list
,用国内源代替官方源;shell
# docker镜像为了精简容量,删除了许多ubuntu自带组件,经过`apt-get update`更新得到 apt-get update apt-get install software-properties-common python-software-properties # add-apt-repository apt-get install software-properties-commonapt-get install software-properties-common # add-apt-repository add-apt-repository ppa:webupd8team/java apt-get update apt-get install oracle-java8-installer java -version
# 建立多级目录 mkdir -p /software/apache/hadoop cd /software/apache/hadoop # 下载并解压hadoop wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz tar xvzf hadoop-2.7.3.tar.gz
修改~/.bashrc文件。在文件末尾加入下面配置信息:apache
export JAVA_HOME=/usr/lib/jvm/java-8-oracle export HADOOP_HOME=/software/apache/hadoop/hadoop-2.7.3 export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
source ~/.bashrc
使环境变量配置生效;
注意:完成./bashrc文件配置后,hadoop-env.sh无需再配置;ubuntu
配置hadoop主要配置core-site.xml、hdfs-site.xml、mapred-site.xml, yarn-site.xml三个文件;vim
在$HADOOP_HOME
下建立namenode
, datanode
和tmp
目录
cd $HADOOP_HOME mkdir tmp mkdir namenode mkdir datanode
配置项hadoop.tmp.dir
指向tmp
目录
配置项fs.default.name
指向master节点,配置为hdfs://master:9000
<configuration> <property> <!-- hadoop temp dir --> <name>hadoop.tmp.dir</name> <value>/software/apache/hadoop/hadoop-2.7.3/tmp</value> <description>A base for other temporary directories.</description> </property> <!-- Size of read/write buffer used in SequenceFiles. --> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <final>true</final> <description>The name of the default file system.</description> </property> </configuration>
dfs.replication
表示节点数目,配置集群1个namenode,3个datanode,设置备份数为4;
dfs.namenode.name.dir
和dfs.datanode.data.dir
分别配置为以前建立的NameNode和DataNode的目录路径
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:9001</value> </property> <property> <name>dfs.replication</name> <value>3</value> <final>true</final> <description>Default block replication.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>/software/apache/hadoop/hadoop-2.7.3/namenode</value> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>/software/apache/hadoop/hadoop-2.7.3/datanode</value> <final>true</final> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration>
在$HADOOP_HOME
下使用cp
命令建立mapred-site.xml
cd $HADOOP_HOME cp mapred-site.xml.template mapred-site.xml
配置mapred-site.xml
,配置项;mapred.job.tracker
指向master节点
在hadoop 2.x.x中,用户无需配置mapred.job.tracker,由于JobTracker已经不存在,功能由组件MRAppMaster实现,所以须要用mapreduce.framework.name指定运行框架名称,指定yarn
——《Hadoop技术内幕:深刻解析YARN架构设计与实现原理》
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:19888</value> </property> </configuration>
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> </configuration>
安装ifconfig
与ping
命令所需软件包
apt-get update apt-get install vim apt-get install net-tools # for ifconfig apt-get install inetutils-ping # for ping
假设当前容器名为container
,保存基础镜像为ubuntu:hadoop
,后续hadoop集群容器都根据该镜像建立启动,无需重复配置;sudo docker commit -m "hadoop installed" container ubuntu:hadoop /bin/bash
分别根据基础镜像ubuntu:hadoop
建立mater容器和slave1~3容器,各自主机名与容器名一致;
建立master:docker run -ti -h master --name master ubuntu:hadoop /bin/bash
建立slave1:docker run -ti -h slave1 --name slave1 ubuntu:hadoop /bin/bash
建立slave2:docker run -ti -h slave2 --name slave2 ubuntu:hadoop /bin/bash
建立slave3:docker run -ti -h slave3 --name slave3 ubuntu:hadoop /bin/bash
在各容器的/etc/hosts
中添加如下内容,各容器ip地址经过ifconfig
查看:
master 172.17.0.2 slave1 172.17.0.3 slave2 172.17.0.4 slave3 172.17.0.5
注意:docker容器重启后,hosts内容可能会失效,经验不足暂时只能避免容器频繁重启,不然得手动再次配置hosts文件;
参考http://dockone.io/question/400
1./etc/hosts, /etc/resolv.conf和/etc/hostname,容器中的这三个文件不存在于镜像,而是存在于/var/lib/docker/containers/<container_id>,在启动容器的时候,经过mount的形式将这些文件挂载到容器内部。所以,若是在容器中修改这些文件的话,修改部分不会存在于容器的top layer,而是直接写入这三个物理文件中。
2.为何重启后修改内容不存在?缘由是:每次Docker在启动容器的时候,经过从新构建新的/etc/hosts文件,这又是为何呢?缘由是:容器重启,IP地址为改变,hosts文件中原来的IP地址无效,所以理应修改hosts文件,不然会产生脏数据。?缘由是:每次Docker在启动容器的时候,经过从新构建新的/etc/hosts文件,这又是为何呢?缘由是:容器重启,IP地址为改变,hosts文件中原来的IP地址无效,所以理应修改hosts文件,不然会产生脏数据。1./etc/hosts, /etc/resolv.conf和/etc/hostname,容器中的这三个文件不存在于镜像,而是存在于/var/lib/docker/containers/<container_id>,在启动容器的时候,经过mount的形式将这些文件挂载到容器内部。所以,若是在容器中修改这些文件的话,修改部分不会存在于容器的top layer,而是直接写入这三个物理文件中。
apt-get update apt-get install ssh apt-get install openssh-server
# 生成无密码密钥,生成密钥位于~/.ssh下 ssh-keygen -t rsa -P ""
将生成的公钥写入authorized_keys中
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
经过修改sshd_config文件,保证ssh可远程登录其余节点的root用户
vim /etc/ssh/sshd_config # 将PermitRootLogin prohibit-password修改成PermitRootLogin yes # 重启ssh服务 service ssh restart
传输master节点上的authorized_keys到其余slave节点~/.ssh下,覆盖同名文件;保证全部节点的证书一致,所以能够实现任意节点间能够经过ssh访问;
cd ~/.ssh scp authorized_keys root@slave1:~/.ssh/ scp authorized_keys root@slave2:~/.ssh/ scp authorized_keys root@slave3:~/.ssh/
chmod 600 ~/.ssh/authorized_keys
查看ssh服务是否开启:ps -e | grep ssh
开启ssh服务:service ssh start
重启ssh服务:service ssh restart
完成2.3.1操做后,各个容器之间可经过ssh访问;
在master节点中,修改slaves文件配置slave节点
cd $HADOOP_CONFIG_HOME/ vim slaves
将其中内容覆盖为:
slave1 slave2 slave3
进入master节点,
执行hdfs namenode -format
,出现相似信息表示namenode格式化成功:
common.Storage: Storage directory /software/apache/hadoop/hadoop-2.7.3/namenode has been successfully formatted.
执行start_all.sh
启动集群:
root@master:/# start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh Starting namenodes on [master] The authenticity of host 'master (172.17.0.2)' can't be established. ECDSA key fingerprint is SHA256:OewrSOYpvfDE6ixf6Gw9U7I9URT2zDCCtDJ6tjuZz/4. Are you sure you want to continue connecting (yes/no)? yes master: Warning: Permanently added 'master,172.17.0.2' (ECDSA) to the list of known hosts. master: starting namenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-namenode-master.out slave3: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave3.out slave2: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave2.out slave1: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave1.out Starting secondary namenodes [master] master: starting secondarynamenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-master.out starting yarn daemons starting resourcemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-resourcemanager-master.out slave3: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave3.out slave1: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave1.out slave2: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave2.out
分别在master,slave节点中执行jps
,
master:
root@master:/# jps 2065 Jps 1446 NameNode 1801 ResourceManager 1641 SecondaryNameNode
slave1:
1107 NodeManager 1220 Jps 1000 DataNode
slave2:
241 DataNode 475 Jps 348 NodeManager
slave3:
500 Jps 388 NodeManager 281 DataNode
在hdfs中建立输入目录/hadoopinput
,并将输入文件LICENSE.txt
存储在该目录下:
root@master:/# hdfs dfs -mkdir -p /hadoopinput root@master:/# hdfs dfs -put LICENSE.txt /hadoopint
进入$HADOOP_HOME/share/hadoop/mapreduce
,提交wordcount任务给集群,将计算结果保存在hdfs中的/hadoopoutput
目录下:
root@master:/# cd $HADOOP_HOME/share/hadoop/mapreduce root@master:/software/apache/hadoop/hadoop-2.7.3/share/hadoop/mapreduce# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /hadoopinput /hadoopoutput 17/05/26 01:21:34 INFO client.RMProxy: Connecting to ResourceManager at master/172.17.0.2:8032 17/05/26 01:21:35 INFO input.FileInputFormat: Total input paths to process : 1 17/05/26 01:21:35 INFO mapreduce.JobSubmitter: number of splits:1 17/05/26 01:21:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1495722519742_0001 17/05/26 01:21:36 INFO impl.YarnClientImpl: Submitted application application_1495722519742_0001 17/05/26 01:21:36 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1495722519742_0001/ 17/05/26 01:21:36 INFO mapreduce.Job: Running job: job_1495722519742_0001 17/05/26 01:21:43 INFO mapreduce.Job: Job job_1495722519742_0001 running in uber mode : false 17/05/26 01:21:43 INFO mapreduce.Job: map 0% reduce 0% 17/05/26 01:21:48 INFO mapreduce.Job: map 100% reduce 0% 17/05/26 01:21:54 INFO mapreduce.Job: map 100% reduce 100% 17/05/26 01:21:55 INFO mapreduce.Job: Job job_1495722519742_0001 completed successfully 17/05/26 01:21:55 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=29366 FILE: Number of bytes written=295977 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=84961 HDFS: Number of bytes written=22002 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=2922 Total time spent by all reduces in occupied slots (ms)=3148 Total time spent by all map tasks (ms)=2922 Total time spent by all reduce tasks (ms)=3148 Total vcore-milliseconds taken by all map tasks=2922 Total vcore-milliseconds taken by all reduce tasks=3148 Total megabyte-milliseconds taken by all map tasks=2992128 Total megabyte-milliseconds taken by all reduce tasks=3223552 Map-Reduce Framework Map input records=1562 Map output records=12371 Map output bytes=132735 Map output materialized bytes=29366 Input split bytes=107 Combine input records=12371 Combine output records=1906 Reduce input groups=1906 Reduce shuffle bytes=29366 Reduce input records=1906 Reduce output records=1906 Spilled Records=3812 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=78 CPU time spent (ms)=1620 Physical memory (bytes) snapshot=451264512 Virtual memory (bytes) snapshot=3915927552 Total committed heap usage (bytes)=348127232 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=84854 File Output Format Counters Bytes Written=22002
计算结果保存在/hadoopoutput/part-r-00000
中,查看结果:
root@master:/# hdfs dfs -ls /hadoopoutput Found 2 items -rw-r--r-- 3 root supergroup 0 2017-05-26 01:21 /hadoopoutput/_SUCCESS -rw-r--r-- 3 root supergroup 22002 2017-05-26 01:21 /hadoopoutput/part-r-00000 root@master:/# hdfs dfs -cat /hadoopoutput/part-r-00000 ""AS 2 "AS 16 "COPYRIGHTS 1 "Contribution" 2 "Contributor" 2 "Derivative 1 "Legal 1 "License" 1 "License"); 1 "Licensed 1 "Licensor" 1 ...
至此,基于docker1.7.03单机上部署hadoop2.7.3集群圆满成功!
[1] http://tashan10.com/yong-dockerda-jian-hadoopwei-fen-bu-shi-ji-qun/
[2] http://blog.csdn.net/xiaoxiangzi222/article/details/52757168