今天在公司服务器centos7上安装hadoop,参考了这个安装教程,同时参考这个博客。html
安装的流程大体以下:java
1.单机安装node
> mkdir /opt/hadoop/input > cp $HADOOP_HOME/*.txt /opt/hadoop/input > hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /opt/hadoop/input /opt/hadoop/ouput
2.伪分布式安装(在单机基础上修改配置便可)python
export JAVA_HOME=/usr/local/jdk1.8.0_181
<configuration> <property> <name>fs.default.name </name> <value> hdfs://localhost:9000 </value> </property> </configuration>
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value> </property> </configuration>
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
hadoop fs -mkdir /user/hadoop/inputs #建立目录 hadoop fs -put /opt/hadoop/input/*.txt /user/hadoop/inputs #上传文件到HDFS文件系统 hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount inputs outputs # inputs outputs 是在HDFS文件系统下的目录
3.彻底分布式(在伪分布式基础上操做便可,已两台电脑为例)linux
电脑01:ip:10.12.28.27 电脑02:ip:10.12.28.144 为了方便输入,用域名代替ip,能够分别都两台机器上: sudo vi /etc/hosts 都添加: 10.12.28.27 master 10.12.28.144 slave1 此时,从01登入到02能够: ssh hadoop@slave1 至关于ssh hadoop@10.12.28.144
01登入02须要输入用户密码。无秘钥通讯:apache
生成ssh秘钥:ssh-keygen -t rsa 此时~/.ssh目录下会生成公钥和私钥 发送公钥到02:ssh-copy-id hadoop@10.12.28.144 一样,在02上操做能够,从02无秘钥登入01
安装java或hadoop能够直接从01拷贝到02上 scp -r /opt/hadoop/hadoop-2.8.5 hadoop@10.12.28.144:/opt/hadoop/
hadoop namenode -format 从新进行格式化命令 start-dfs.sh 启动hdfs 使用jps会看见,在master上面启动了NameNode,SecondaryNameNode 在slave1上使用jps查看,会看见上面启动了 DataNode start-yarn.sh 启动yarn master端多了一个ResourceManager,slave1端多了一个NodeManager。
下面就记录安装过程当中遇到的一些问题:centos
1.jdk版本问题bash
jdk用最新的11版本后,运行hadoop会出现以下信息,可能形成hadoop运行不成功。建议jdk安装8版本及以前版本服务器
WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.ibatis.reflection.Reflector (file:/C:/Users/jiangcy/.m2/repository/org/mybatis/mybatis/3.4.5/mybatis-3.4.5.jar) to method java.lang.Object.finalize() WARNING: Please consider reporting this to the maintainers of org.apache.ibatis.reflection.Reflector WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release
2.~/.bashrc文件配置问题(环境变量配置)网络
在安装完jdk后,须要配置~/.bashrc文件,按照上面参考教程提供的配置:
export JAVA_HOME=/usr/local/jdk1.8.0_181 export PATH=PATH:$JAVA_HOME/bin export HADOOP_HOME=/opt/hadoop/hadoop-2.8.5
执行source ~/.bashrc生效,会出现其它命令不能用的状况如:‘-bash: ls: 未找到命令’(由于PATH路径被修改了)
应该改成:
export JAVA_HOME=/usr/local/jdk1.8.0_181 export PATH=$JAVA_HOME/bin:$PATH export HADOOP_HOME=/opt/hadoop/hadoop-2.8.5 export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
3.SSH设置和密钥生成
配置为能够免密码登录本机。实际上,在hadoop的安装过程当中,是否免密码登录是可有可无的,可是若是不配置免密码登录,每次启动hadoop都须要输入密码以登录到每台DataNode上。不配置免密码登录时,当集群大了,就会很麻烦。
4.hadoop fs -mkdir /user/input 报错
将数据插入到HDFS须要建立一个目录,执行上面语句会出现hadoop fs -mkdir:No such file or directory的错误。
缘由:HDFS默认的工做目录是/user/<你的登陆用户名>,可是HDFS文件系统可能只有根目录,注意:HDFS文件系统目录和本地目录不是一个。参考
因此能够以下操做:
hdfs经常使用操做 1.对hdfs操做的命令格式是hadoop fs 1.1 -ls 表示对hdfs下一级目录的查看 1.2 -lsr 表示对hdfs目录的递归查看 1.3 -mkdir 建立目录 1.4 -put 从linux上传文件到hdfs 1.5 -get 从hdfs下载文件到linux 1.6 -text 查看文件内容 1.7 -rm 表示删除文件 -rm -r 删除文件夹 1.7 -rmr 表示递归删除文件
5.用mapreduce的stream流操做执行python脚本
首先用命令:find / -name 'hadoop-streaming*.jar' 找到hadoop安装目录中streaming的java应用程序,不一样版本肯能存放的目录不同,个人在目录下:/opt/hadoop/hadoop-2.8.5/share/hadoop/tools/lib/hadoop-streaming-2.8.5.jar
执行命令:
hadoop jar hadoop-streaming-2.8.5.jar -input inputs -output py_outs -mapper /opt/hadoop/mapper.py -reducer /opt/hadoop/reducer.py
python计数脚本:
mapper.py
#!/usr/bin/env python # -*- coding:UTF-8 -*- import sys #输入为标准输入stdin for line in sys.stdin: #删除开头和结尾的空格 line = line.strip() #以默认空格分隔行单词到words列表 words = line.split() for word in words: #输出全部单词,格式为“单词,1”以便做为Reduce的输入 print ('%s\t%s' % (word, 1))
reducer.py
#!/usr/bin/env python # -*- coding:UTF-8 -*- # from operator import itemgetter import sys current_word = None current_count = 0 word = None # 获取标准输入,即mapper.py的输出 for line in sys.stdin: # 删除开头和结尾的空格 line = line.strip() # 解析mapper.py输出做为程序的输入,以tab做为分隔符 word, count = line.split('\t', 1) # 转换count从字符型成整型 try: count = int(count) except ValueError: # count不是数据时,忽略此行 continue # 要求mapper.py的输出作排序操做,以便对连续的word作判断,hadoop会自动排序 if current_word == word: current_count += count else: if current_word: # 输出当前word统计结果到标准输出 print('%s\t%s' % (current_word, current_count)) current_count = count current_word = word # 输出最后一个word统计 if current_word == word: print('%s\t%s' % (current_word, current_count))
6.集群安装后,启动报错
slave1: /opt/hadoop/hadoop-2.8.5/bin/hdfs: line 305: /usr/local/jdk1.8.0_181/bin/java: 没有那个文件或目录 slave1: /opt/hadoop/hadoop-2.8.5/bin/hdfs: line 305: exec: /usr/local/jdk1.8.0_181/bin/java: cannot execute: 没有那个文件或目录
从节点slave1机器上hadoop配置须要修改,我这里由于jdk安装目录不同致使的。修改hadoop-env.sh、mapred-env.sh、yarn-env.sh文件中的jdk路径
export JAVA_HOME=/usr/local/java/jdk1.8.0_131
7.集群安装完,数据上传到HDFS出错
[hadoop@localhost ~]$ hadoop fs -put /opt/hadoop/input/*.txt /user/hadoop/inputs 18/10/18 16:14:24 WARN hdfs.DataStreamer: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hadoop/inputs/LICENSE.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1726) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2567) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829) . . . put: File /user/hadoop/inputs/README.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
缘由:参考