用了一段时间的hadoop,如今回来看看源码发现别有一番味道,温故而知新,还真是这样的java
在使用hadoop以前咱们须要配置一些文件,hadoop-env.sh,core-site.xml,hdfs-site.xml,mapred-site.xml。那么这些文件在何时被hadoop使用?node
通常的在启动hadoop的时候使用最多就是start-all.sh,那么这个脚本都干了些什么?linux
# Start all hadoop daemons. Run this on master node. #特别的地方时要在master节点上启动hadoop全部进程 bin=`dirname "$0"` bin=`cd "$bin"; pwd` #bin=$HADOOP_HOME/bin if [ -e "$bin/../libexec/hadoop-config.sh" ]; then . "$bin"/../libexec/hadoop-config.sh else . "$bin/hadoop-config.sh" fi # start dfs daemons "$bin"/start-dfs.sh --config $HADOOP_CONF_DIR # start mapred daemons "$bin"/start-mapred.sh --config $HADOOP_CONF_DIR
if [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then . "${HADOOP_CONF_DIR}/hadoop-env.sh" fi
测试$HADOOP_HOME/conf/hadoop-env.sh为普通文件后,经过 . "${HADOOP_CONF_DIR}/hadoop-env.sh"执行hadoop-env.sh这个脚本,ok,咱们在hadoop-env.sh中配置的环境变量 JAVA_HOME 生效了,其实我感受这个地方彻底能够不用配置,为何?由于咱们在linux上安装hadoop确定要安装java,那么安装时确定都会配置JAVA_HOME,在/etc/profile中配置的环境变量在任何的shell进程中都生效。shell
# Run this on master node. usage="Usage: start-dfs.sh [-upgrade|-rollback]" bin=`dirname "$0"` bin=`cd "$bin"; pwd` if [ -e "$bin/../libexec/hadoop-config.sh" ]; then . "$bin"/../libexec/hadoop-config.sh else . "$bin/hadoop-config.sh" fi # get arguments if [ $# -ge 1 ]; then nameStartOpt=$1 shift case $nameStartOpt in (-upgrade) ;; (-rollback) dataStartOpt=$nameStartOpt ;; (*) echo $usage exit 1 ;; esac fi # start dfs daemons # start namenode after datanodes, to minimize time namenode is up w/o data # note: datanodes will log connection errors until namenode starts "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
仔细看看不能发现,在start-dfs.sh中一样也会执行hadoop-config.sh,之因此有这一步,是由于咱们不老是使用start-all.sh来启动hadoop的全部进程,有时候咱们只须要使用hdfs而不须要MapReduce,此时只须要单独执行start-dfs.sh,一样hadoop-config.sh中定义的变量也会被文件系统相关进程使用,因此这里在启动namenode,datanode,secondarynamenode以前须要先执行hadoop-config.sh,同时hadoop-env.sh文件被执行。再来看看最后的三行代码,分别是启动namenode,datanode,secondarynamenode的脚本。启动hadoop后一共有5个进程,其中三个就是namenode,datanode,secondarynamenode,既然能启动进程说明对应的类中必定有main方法,看看源码就能够验证这一点,这不是重点,重点是来看看对应的类是怎么加载配置文件的。不管是namenode,仍是dataname,仍是secondarynamenode,他们在启动时都会加载core-*.xml和hdfs-*.xml文件,以org.apache.hadoop.hdfs.server.namenode.NameNode 这个类为例,其余的两个类org.apache.hadoop.hdfs.server.datanode.DataNode,org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode相似。apache
public class NameNode implements ClientProtocol, DatanodeProtocol, NamenodeProtocol, FSConstants, RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol { static{ Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); } ... }
看看静态代码块里面内容,会很兴奋,看到了hdfs-default.xml和hdfs-site.xml。对重点就在这里,static code block在类加载到jvm执行类的初始化时会执行(不是对象初始化)。Configuration.addDefaultResource("hdfs-default.xml");这段代码执行前会先将Configuration这个类加载jvm中,那么看下org.apache.hadoop.conf.Configuration这个类中的static code block干了些什么app
static{ //print deprecation warning if hadoop-site.xml is found in classpath ClassLoader cL = Thread.currentThread().getContextClassLoader(); if (cL == null) { cL = Configuration.class.getClassLoader(); } if(cL.getResource("hadoop-site.xml")!=null) { LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " + "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, " + "mapred-site.xml and hdfs-site.xml to override properties of " + "core-default.xml, mapred-default.xml and hdfs-default.xml " + "respectively"); } addDefaultResource("core-default.xml"); addDefaultResource("core-site.xml"); }
Configuration类在类的初始化时加载了core-default.xml和core-site.xml这两个文件。这样namenode在启动的时候就加载了core-*.xml和hdfs-*.xml文件,其中core-*.xml是由Configuration这个类加载的。jvm
start-mapred.sh # Start hadoop map reduce daemons. Run this on master node. bin=`dirname "$0"` bin=`cd "$bin"; pwd` if [ -e "$bin/../libexec/hadoop-config.sh" ]; then . "$bin"/../libexec/hadoop-config.sh else . "$bin/hadoop-config.sh" fi # start mapred daemons # start jobtracker first to minimize connection errors at startup "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker
该脚本一样也会执行hadoop-config.sh,一样也会执行hadoop-env.sh。这里和start-dfs.sh是统一的。最后两行代码是启动jobtracker和tasktracker进程的。一样对应着两个类org.apache.hadoop.mapred.JobTracker和org.apache.hadoop.mapred.TaskTrackeride
public class JobTracker implements MRConstants, InterTrackerProtocol, JobSubmissionProtocol, TaskTrackerManager, RefreshUserMappingsProtocol, RefreshAuthorizationPolicyProtocol, AdminOperationsProtocol, JobTrackerMXBean { static{ Configuration.addDefaultResource("mapred-default.xml"); Configuration.addDefaultResource("mapred-site.xml"); } ... }