Mapreduce 是一个分布式运算程序的编程框架,核心功能是将用户编写的业务逻辑代码和自带默认组件整合成一个完整的 分布式运算程序,并发运行在一个 hadoop 集群上。MapReduce采用“分而治之”策略,一个存储在分布式文件系统中的大规模数据集,会被切分红许多独立的分片(split),这些分片能够被多个Map任务并行处理。编程
Hadoop 的四大组件:vim
(1)HDFS:分布式存储系统;
(2)MapReduce:分布式计算系统;
(3)YARN: hadoop 的资源调度系统;
(4)Common: 以上三大组件的底层支撑组件,主要提供基础工具包和 RPC 框架等;并发
在 MapReduce 组件里, 官方给咱们提供了一些样例程序,其中很是有名的就是 wordcount 和 pi 程序,这些程序代码都在 hadoop-example.jar 包里,jar包的安装目录在Hadoop下,为:app
/share/hadoop/mapreduce
下面咱们来逐一解读这两个样例程序。框架
测试前,先关闭防火墙,启动Zookeeper、Hadoop集群,依次顺序为 :分布式
./start-dfs.sh ./start-yarn.sh
成功启动后,查看进程是否完整。这些可参考以前博客中关于集群的搭建。工具
1、pi样例程序oop
(1)执行命令,带上参数测试
[hadoop@slave01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar pi 5 5 Number of Maps = 5 Samples per Map = 5 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Starting Job ... ... 省略一部分 ... ... 18/06/27 16:22:56 INFO mapreduce.Job: map 0% reduce 0% 18/06/27 16:28:12 INFO mapreduce.Job: map 73% reduce 0% 18/06/27 16:28:13 INFO mapreduce.Job: map 100% reduce 0% 18/06/27 16:29:26 INFO mapreduce.Job: map 100% reduce 100% 18/06/27 16:29:29 INFO mapreduce.Job: Job job_1530087649012_0001 completed successfully 18/06/27 16:29:30 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=116 FILE: Number of bytes written=738477 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1320 HDFS: Number of bytes written=215 HDFS: Number of read operations=23 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=5 Launched reduce tasks=1 Data-local map tasks=5 Total time spent by all maps in occupied slots (ms)=1625795 Total time spent by all reduces in occupied slots (ms)=48952 Total time spent by all map tasks (ms)=1625795 Total time spent by all reduce tasks (ms)=48952 Total vcore-milliseconds taken by all map tasks=1625795 Total vcore-milliseconds taken by all reduce tasks=48952 Total megabyte-milliseconds taken by all map tasks=1664814080 Total megabyte-milliseconds taken by all reduce tasks=50126848 Map-Reduce Framework Map input records=5 Map output records=10 Map output bytes=90 Map output materialized bytes=140 Input split bytes=730 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=140 Reduce input records=10 Reduce output records=0 Spilled Records=20 Shuffled Maps =5 Failed Shuffles=0 Merged Map outputs=5 GC time elapsed (ms)=107561 CPU time spent (ms)=32240 Physical memory (bytes) snapshot=500453376 Virtual memory (bytes) snapshot=12460331008 Total committed heap usage (bytes)=631316480 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=590 File Output Format Counters Bytes Written=97 Job Finished in 452.843 seconds Estimated value of Pi is 3.68000000000000000000
执行程序,参数含义:url
第1个参数5指的是要运行5次map任务 ;
第2个参数5指的是每一个map任务,要投掷多少次 ;
2个参数的乘积就是总的投掷次数(pi代码就是以投掷来计算值)。
经过上面咱们得到了Pi的值:3.680000,固然也能够改变参数来验证得出的结果和参数的关系,好比个人参数换成10和10,则得出的结果为:3.20000。因而可知:参数越大,结果越是精确。
(2)查看运行进程
在执行过程当中,它的时间不定,因此咱们能够经过访问界面,查看具体的运行进程,访问:
slave01:8088
界面显示以下:
从上面咱们能够看出:当Progress进程结束,即表明运算过程结束,也能够点击查看具体的内容,这里不作演示了。
2、wordcount样例程序
(1)准备数据,上传HDFS
简单的说就是单词统计,这里咱们新建一个txt文件,输入一些单词,方便统计:
[hadoop@slave01 mapreduce]$ touch wordcount.txt [hadoop@slave01 mapreduce]$ vim wordcount.txt
输入如下单词,并保存:
hello word ! you can help me ? yes , I can How do you do ?
上传到HDFS,先在hdfs上建立文件夹,在将txt文件放到该文件夹下,下面是一种建立方式,或者是hadoop fs -mkdir 的方式,两者择其一,注意路径:
[hadoop@slave01 bin]$ hdfs dfs -mkdir -p /wordcount [hadoop@slave01 bin]$ hdfs dfs -put ../share/hadoop/mapreduce/wordcount.txt /wordcount [hadoop@slave01 bin]$
咱们能够经过访问 slave01:50070,查看HDFS文件系统:
成功上传。
(2)运行程序
执行下面的命令,注意路径:
[hadoop@slave01 bin]$ yarn jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /wordcount /word_output 18/06/27 17:34:24 INFO client.RMProxy: Connecting to ResourceManager at slave01/127.0.0.1:8032 18/06/27 17:34:30 INFO input.FileInputFormat: Total input paths to process : 1 18/06/27 17:34:30 INFO mapreduce.JobSubmitter: number of splits:1 18/06/27 17:34:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1530087649012_0003 18/06/27 17:34:32 INFO impl.YarnClientImpl: Submitted application application_1530087649012_0003 18/06/27 17:34:33 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1530087649012_0003/ 18/06/27 17:34:33 INFO mapreduce.Job: Running job: job_1530087649012_0003 18/06/27 17:34:52 INFO mapreduce.Job: Job job_1530087649012_0003 running in uber mode : false 18/06/27 17:34:52 INFO mapreduce.Job: map 0% reduce 0% 18/06/27 17:35:02 INFO mapreduce.Job: map 100% reduce 0% 18/06/27 17:35:31 INFO mapreduce.Job: map 100% reduce 100% 18/06/27 17:35:32 INFO mapreduce.Job: Job job_1530087649012_0003 completed successfully ... ... 省略部分 ... ... Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=59 File Output Format Counters Bytes Written=72
命令参数的含义:
第一个指的是jar包路径,第二个指的是要执行的样例程序名称wordcount,第三个指的是文件所在的HDFS路径,第四个指的是要输出的文件目录(不要是已经存在的)。
上面是输出结果,一样的咱们能够经过访问 slave01:8088 查看进程。
执行结束后,在HDFS文件系统上,能够看到输出的目录已经建立好了,且里面存在了输出的文件:
经过命令,能够查看执行后的结果文件:
[hadoop@slave01 bin]$ hdfs dfs -text /word_output/part* ! 1 , 1 ? 2 How 1 I 1 can 2 do 2 hello 1 help 1 me 1 word 1 yes 1 you 2 [hadoop@slave01 bin]$
从上面能够看出:单词已经统计完成,咱们能够对照文件进行验证。
好了,上面是对两个已有样例的解读,至于代码方面有空再一块儿讨论吧。