MapReduce实现WordCount, 及其优化

时间 2019-11-25

标签 mapreduce 实现 wordcount 及其优化栏目 Hadoop 繁體版

原文原文链接

WordCount: 单词计数, 统计文本文件中每个单词出现的次数java

定义Mapper类, 该类继承org.apache.hadoop.mapreduce.Mapperapache

并重写map()方法app

public static class TokenizerMapper extends
			Mapper<LongWritable, Text, Text, IntWritable> {
	        // 定义一个静态成员变量, 而且是不可变的, 避免每一次调用map()方法时, 建立重复对象
		private final static IntWritable one = new IntWritable(1);
		// 定义一个成员变量, 可变, 每一次调用map()方法时, 只须要调用Text.set()方法赋新值
		private Text word = new Text();

		public void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String[] words = value.toString().split(" ");
			for (String item : words) {
				word.set(item);
				context.write(word, one);
			}
		}
	}

定义Reducer类, 该类继承org.apache.hadoop.mapreduce.Reduceroop

并重写reduce()方法测试

public static class IntSumReducer extends
			Reducer<Text, IntWritable, Text, IntWritable> {
		// 定义一个成员变量, 可变, 每一次调用reduce()方法时, 只须要调用IntWritable.set()方法赋新值
		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable val : values) {
				sum += val.get();
			}
			result.set(sum);
			context.write(key, result);
		}
	}

测试WordCountspa

public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		job.setJarByClass(WordCount.class); // 设置job的主类
		job.setMapperClass(TokenizerMapper.class); // 设置Mapper类
		// 利用combiner来减小经过shuffle传输的数据量
		job.setCombinerClass(IntSumReducer.class); // 设置Combiner类
		job.setReducerClass(IntSumReducer.class); // 设置Reducer类
		job.setMapOutputKeyClass(Text.class); // 设置map阶段输出Key的类型
		job.setMapOutputValueClass(IntWritable.class); // 设置map阶段输出Value的类型
		job.setOutputKeyClass(Text.class); // 设置reduce阶段输出Key的类型
		job.setOutputValueClass(IntWritable.class); // 设置reduce阶段输出Value的类型
		// 设置job输入路径(从main方法参数args中获取)
		FileInputFormat.addInputPath(job, new Path(args[0]));
		// 设置job输出路径(从main方法参数args中获取)
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		job.waitForCompletion(true); // 提交job
	}

输入: code

words:orm

hello tom
hello jerry
hello kitty
hello world
hello tom

输出:对象

hello	5
jerry	1
kitty	1
tom	2
world	1

减小对象的建立, 更少的GC, 确定会带来更快的速度继承

利用combiner来减小经过shuffle传输的数据量, 这是MapReduce做业调优的关键点之一