Eclipse3.3(windows7)链接远程hadoop(RedHat.Enterprise.Linux.5)并测试程序 java
机器名 node |
IP linux |
做用 apache |
NameNode windows |
192.168.1.1 app |
NameNode、master、jobTracker eclipse |
DataNode1 ssh |
192.168.1.2 jsp |
DataNode、slave、taskTracker ide |
DataNode2 |
192.168.1.3 |
DataNode、slave、taskTracker |
机器知足1G内存,2G 更好。Linux5安装后,可不启动图形界面,节约内存。
安装步骤:
一、安装RedHat.Enterprise.Linux.5
用介质安装linux,安装完后修改机器名:$hostname 机器名。
在/etc/hosts中添加机器名和相应的IP:
127.0.0.1 localhost
192.168.1.1 NameNode
192.168.1.2 DataNode1
192.168.1.3 DataNode2
修改/etc/inittab文件:
id:5:initdefault: 改成id:3:initdefault:
从新启动OS就不会进入图形界面了
二、开启ssh服务
#servicesshd start
能够在windows下用SSH Secure Shell Client来测试一下。
三、关闭防火墙(全部机器)
# chkconfig --levels 2345 iptables off
注意:这步很是重要。若是不关闭,会出现找不到datanode问题。
四、创建ssh无密码登陆
(1)在NameNode上实现无密码登陆本机:
$ ssh-keygen -t rsa
直接回车,完成后会在~/.ssh/生成两个文件:id_rsa和id_rsa.pub。
$ ssh-keygen -t dsa
直接回车,完成后会在~/.ssh/生成两个文件:id_dsa和id_dsa.pub。
$cat~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys 将生成的密钥串在钥匙链上
$cat~/.ssh/id_dsa.pub >>~/.ssh/authorized_keys 将生成的密钥串在钥匙链上
$chmod 600 ~/.ssh/authorized_keys。
(2)实现NameNode无密码登陆其余DataNode:
把NameNode上的authorized_keys 文件追加到dataNode的authorized_keys 内(以
192.168.0.2节点为例):
a.拷贝NameNode 的authorized_keys 文件:
$ scp authorized_keys mark@192.168.0.2:/home/mark/
b.登陆192.168.0.2,执行$ cat authorized_keys >>~/.ssh/authorized_keys
其余的dataNode执行一样的操做。
注意:若是配置完毕,若是namenode依然不能访问datanode,能够修改datanode的
authorized_keys的读写权限(很重要!):
$ chmod 600 ~/.ssh/authorized_keys。
五、安装jdk1.6
下载地址:http://java.sun.com/javase/downloads/widget/jdk6.jsp,下载后,直接安装。本例的安装路径为/usr/java/jdk1.6.0_31。
安装后,添加以下语句到/etc/profile中:
exportJAVA_HOME==/usr/java/jdk1.6.0_31
exportJRE_HOME==/usr/java/jdk1.6.0_31 /jre
exportCLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
exportPATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
注意:每台机器的java环境最好一致。安装过程当中若有中断,切换为root 权限来安装。
六、安装hadoop
下载hadoop-0.20.2.tar.gz
解压:$tar –zvxf hadoop-0.20.2.tar.gz
把Hadoop的安装路径添加到环/etc/profile 中:
exportHADOOP_HOME=/home/mark/hadoop-0.20.2
exportPATH=$HADOOP_HOME/bin:$PATH
七、配置hadoop
hadoop的主要配置都在hadoop-0.20.2/conf 下。
(1)在conf/hadoop-env.sh中配置Java 环境(namenode与datanode 的配置相同):
$gedit hadoop-env.sh
$export JAVA_HOME=/usr/java/jdk1.6.0_31
(2)配置conf/masters和conf/slaves 文件:(只在namenode上配置)
masters:192.168.1.1
slaves:
192.168.1.2
192.168.1.3
(3)配置conf/core-site.xml,conf/hdfs-site.xml 及conf/mapred-site.xml(简单配置,datanode的配置相同)
core-site.xml:
<configuration>
<!---global properties -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/mark/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<!--file system properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.1:9000</value>
</property>
</configuration>
hdfs-site.xml:(replication 默认为3,若是不修改,datanode少于三台就会报错)
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.1.1:9001</value>
</property>
</configuration>
八、运行hadoop
进入hadoop-0.20.2/bin,首先格式化文件系统:$hadoop namenode –format
启动Hadoop:$start-all.sh
$./start-dfs.sh
$./start-mapred.sh
用jps命令查看进程,NameNode 上的结果以下:
[mark@namenode~]$ jps
8872JobTracker
8650NameNode
15183Jps
8780SecondaryNameNode
[mark@namenode~]$
DataNode上的结果:
[mark@DataNode1~]$ jps
7346DataNode
28263Jps
7444TaskTracker
[mark@DataNode1~]$
查看集群状态:$hadoop dfsadmin –report
[mark@namenode~]$ hadoop dfsadmin -report
ConfiguredCapacity: 222387527680 (207.11 GB)
PresentCapacity: 201404645376 (187.57 GB)
DFSRemaining: 201404182528 (187.57 GB)
DFSUsed: 462848 (452 KB)
DFSUsed%: 0%
Underreplicated blocks: 2
Blockswith corrupt replicas: 0
Missingblocks: 0
-------------------------------------------------
Datanodesavailable: 3 (3 total, 0 dead)
Name:192.168.1.2:50010
DecommissionStatus : Normal
ConfiguredCapacity: 60261593088 (56.12 GB)
DFSUsed: 167936 (164 KB)
NonDFS Used: 6507544576 (6.06 GB)
DFSRemaining: 53753880576(50.06 GB)
DFSUsed%: 0%
DFSRemaining%: 89.2%
Lastcontact: Fri Mar 30 10:18:12 CST 2012
Name:192.168.1.3:50010
DecommissionStatus : Normal
ConfiguredCapacity: 101864341504 (94.87 GB)
DFSUsed: 143360 (140 KB)
NonDFS Used: 7971401728 (7.42 GB)
DFSRemaining: 93892796416(87.44 GB)
DFSUsed%: 0%
DFSRemaining%: 92.17%
Lastcontact: Fri Mar 30 10:18:12 CST 2012
九、运行wordcount.java程序
(1)先在本地磁盘创建两个输入文件file01和file02:
$echo “Hello World Bye World” > file01
$echo “Hello Hadoop Goodbye Hadoop” > file02
(2)在hdfs中创建一个input 目录:$hadoop fs –mkdir input
(3)将file01和file02 拷贝到hdfs 中:
$hadoop fs –copyFromLocal /home/mark/file0* input
(4)执行wordcount:
$hadoop jar hadoop-0.20.2-examples.jar wordcount input output
(5)完成以后,查看结果:
$hadoop fs -cat output/part-r-00000
Bye1
Goodbye1
Hadoop2
Hello2
World2
1.eclipse安装hadoop插件
Eclipse使用3.3版本,安装文件是eclipse-jee-europa-winter-win32.zip,根据网上的资料,其余版本 可能会出现问题。在hadoop的安装包中带有eclipse的插件,在\contrib\eclipse-plugin\hadoop-0.20.2- eclipse-plugin.jar,将此文件拷贝到eclipse的plugins中。安装成功以后在projectexplorer中会有一个 DFSlocations的标志
在windows->preferences里面会多一个hadoopmap/reduce的选项,选中这个选项,而后在右边的表单中,把下载的hadoop根目录选中。
能看到以上两点,说明插件安装成功了。
2.hadoop链接参数配置
如图所示,打开Map/ReduceLocations视图,在右上角有个大象的标志点击
在点击大象后弹出的对话框进行参数的添加,以下图
locationname :随便填
map/reducemaster这个框里
host:就是jobtracker的ip
port:就是jobtracker的port,这里填9001
这两个参数就是mapred-site.xml里面的mapred.job.tracker里面的ip和port
DFSmaster这个框里
host:namenode的IP
port:就是namenode的port,这里填9000
这两个参数就是core-site.xml里fs.default.name里面的ip和port
username:这个是链接hadoop的用户名
下面的不用填,点击finish按钮,此时,这个视图中就有了一条记录。
重启eclipse,而后从新编辑刚才创建的那个链接记录,如图,在上一步里咱们填写的Generaltab页,如今编辑advance parameter tab页。
hadoop.job.ugi:这里填写mark,root,Users mark是hadoop集群中安装集群所使用的用户名。
而后点击finish,就连上了。
连上的标志如图。
3.写个wordcount程序,并在eclipse里测试。
在eclipse中建一个map/reduce工程,如图
而后在这个工程下面添加java类MyMap.java以下
packageorg;
importjava.io.IOException;
importjava.util.StringTokenizer;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Mapper;
publicclass MyMap extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable one =new IntWritable(1);
private Text word;
public void map(Object key,Textvalue,Context context) throws IOException,InterruptedException{
String line =value.toString();
StringTokenizer tokenizer = newStringTokenizer(line);
while(tokenizer.hasMoreTokens()){
word = new Text();
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
添加MyReduce.java类以下
package org;
import java.io.IOException;
importorg.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Reducer;
publicclass MyReduce extendsReducer<Text,IntWritable,Text,IntWritable> {
publicvoid reduce(Text key,Iterable<IntWritable>values,Context context)throws IOException,InterruptedException{
int sum=0;
for(IntWritable val:values){
sum+=val.get();
}
context.write(key, new IntWritable(sum));
}
}
添加MyDriver类以下
packageorg;
importjava.io.IOException;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.JobClient;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.input.TextInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
publicclass MyDriver {
/**
*@param args
*/
public static void main(String[] args)throws Exception,InterruptedException{
Configuration conf = newConfiguration();
Job job = new Job(conf,"HelloHadoop");
job.setJarByClass(MyDriver.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyReduce.class);
job.setReducerClass(MyReduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//JobClient.runJob(conf);
job.waitForCompletion(true);
}
}
4. 进入c:\windows\system32\drivers\etc目录,打开hosts文件加入
192.168.1.1NameNode
ip是master的ip,NameNode是master的机器名
5.而后设置MyDriver类的执行参数,也就是输入输出参数,要指定输入输出的文件夹
input就是输入文件存放路径,本例中要保证input目录中有要进行wordcount的文本文件,output就是mapReduce以后处理的数据结果输出的路径
6.而后runon hadoop
控制台打印信息以下:
12/03/30 09:28:08 WARN conf.Configuration: DEPRECATED:hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated.Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to overrideproperties of core-default.xml, mapred-default.xml and hdfs-default.xmlrespectively
12/03/30 09:28:08 WARN mapred.JobClient: UseGenericOptionsParser for parsing the arguments. Applications should implementTool for the same.
12/03/30 09:28:08 INFO input.FileInputFormat: Total inputpaths to process : 2
12/03/30 09:28:09 INFO mapred.JobClient: Running job:job_201203281633_0006
12/03/30 09:28:10 INFO mapred.JobClient: map 0% reduce 0%
12/03/30 09:28:19 INFO mapred.JobClient: map 100% reduce 0%
12/03/30 09:28:31 INFO mapred.JobClient: map 100% reduce 100%
12/03/30 09:28:33 INFO mapred.JobClient: Job complete:job_201203281633_0006
12/03/30 09:28:33 INFO mapred.JobClient: Counters: 18
12/03/30 09:28:33 INFO mapred.JobClient: Job Counters
12/03/30 09:28:33 INFO mapred.JobClient: Launched reduce tasks=1
12/03/30 09:28:33 INFO mapred.JobClient: Rack-local map tasks=1
12/03/30 09:28:33 INFO mapred.JobClient: Launched map tasks=2
12/03/30 09:28:33 INFO mapred.JobClient: Data-local map tasks=1
12/03/30 09:28:33 INFO mapred.JobClient: FileSystemCounters
12/03/30 09:28:33 INFO mapred.JobClient: FILE_BYTES_READ=79
12/03/30 09:28:33 INFO mapred.JobClient: HDFS_BYTES_READ=50
12/03/30 09:28:33 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228
12/03/30 09:28:33 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
12/03/30 09:28:33 INFO mapred.JobClient: Map-Reduce Framework
12/03/30 09:28:33 INFO mapred.JobClient: Reduce input groups=5
12/03/30 09:28:33 INFO mapred.JobClient: Combine output records=6
12/03/30 09:28:33 INFO mapred.JobClient: Map input records=2
12/03/30 09:28:33 INFO mapred.JobClient: Reduce shuffle bytes=85
12/03/30 09:28:33 INFO mapred.JobClient: Reduce output records=5
12/03/30 09:28:33 INFO mapred.JobClient: Spilled Records=12
12/03/30 09:28:33 INFO mapred.JobClient: Map output bytes=82
12/03/30 09:28:33 INFO mapred.JobClient: Combine input records=8
12/03/30 09:28:33 INFO mapred.JobClient: Map output records=8
12/03/30 09:28:33 INFO mapred.JobClient: Reduce input records=6
7.最后到output里去看一下程序执行结果。