Author: Lijbhtml
大数据(bigData)java
大数据解决问题?node
打破单机存储瓶颈(数量有限,数据不安全),读写效率低下(顺序化读写)。大数据提出以分布式存储作为大数据存储和指导思想,经过构建集群,实现对硬件作水平扩展提高系统存储能力。目前为止经常使用海量数据存储的解决方案:Hadoop HDFS、FastDFS/GlusterFS(传统文件系统)、MongoDB GridFS、S3等mysql
单机计算所能计算的数据量有限,并且所需时间没法控制。大数据提出一种新的思惟模式,讲计算拆分红n个小计算任务,而后将计算任务分发给集群中各个计算节点,由各个计算几点合做完成计算,咱们将该种计算模式称为分布式计算。目前主流的计算模式:离线计算、近实时计算、在线实时计算等。其中离线计算以Hadoop的MapReduce为表明、近实时计算以Spark内存计算为表明、在线实时计算以Storm、KafkaStream、SparkStream为表明。linux
总结: 以上均是以分布式的计算思惟去解决问题,由于垂直提高硬件成本高,通常企业优先选择作分布式水平扩展。程序员
Hadoop 诞生web
Hadoop由 Apache Software Foundation 公司于 2005 年秋天做为Lucene的子项目Nutch的一部分正式引入。它受到最早由 Google Lab 开发的 Map/Reduce 和 Google File System(GFS) 的启发。人们重构Nutch项目中的存储模块和计算模块。2006 年Yahoo团队加入Nutch工程尝试将Nutch存储模块和计算模块剥离所以就产生Hadoop1.x版本。2010年yahoo重构了hadoop的计算模块,进一步优化Hadoop中MapReduce计算框架的管理模型,后续被称为Hadoop-2.x版本。hadoop-1.x和hadoop-2.x版本最大的区别是计算框架重构。由于Hadoop-2.x引入了Yarn做为hadoop2计算框架的资源调度器,使得MapReduce框架扩展性更好,运行管理更加可靠。正则表达式
大数据生态圈算法
hadoop生态-2006(物资文明):sql
计算层面-2010年(精神文明)
大数据应用场景?
Hadoop Distribute FileSystem
HDFS 架构
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware(商用硬件).Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
namenode:命名节点,用于管理集群中元数据(描述了数据块到datanode的映射关系以及块的副本信息)以及DataNode。控制数据块副本数(副本因子)能够配置dfs.replication参数,若是是单机模式须要配置1,其次nenodenode的client访问入口fs.defaultFS。在第一次搭建hdfs的时候须要执行hdfs namenode -foramt做用就是为namenode建立元数据初始化文件fsimage.
dataNode:存储block数据,响应客户端对block读写请求,执行namenode指令,同时向namenode汇报自身的状态信息。
block:文件的切割单位默认是128MB,能够经过配置dfs.blocksize
Rack: 机架,用于优化存储和计算,标识datanode所在物理主机的位置信息,能够经过hdfs dfsadmin -printTopology
阅读:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Hadoop HDFS(分布式存储)搭建
配置主机名和IP的映射关系
[root@CentOS ~]# vi /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.29.128 CentOS
关闭系统防火墙
[root@CentOS ~]# clear [root@CentOS ~]# service iptables stop iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Flushing firewall rules: [ OK ] iptables: Unloading modules: [ OK ] [root@CentOS ~]# chkconfig iptables off [root@CentOS ~]# chkconfig --list | grep iptables iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off [root@CentOS ~]#
配置本机SSH免密码登陆
[root@CentOS ~]# ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: de:83:c8:81:77:d7:db:f2:79:da:97:8b:36:d5:78:01 root@CentOS The key's randomart image is: +--[ RSA 2048]----+ | | | E | | . | | . . . | | . o S . . .o| | o = + o..o| | o o o o o..| | . =.+o| | ..=++| +-----------------+ [root@CentOS ~]# ssh-copy-id CentOS root@centos's password: **** Now try logging into the machine, with "ssh 'CentOS'", and check in:
.ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting [root@CentOS ~]# ssh CentOS Last login: Mon Oct 15 19:47:46 2018 from 192.168.29.1 [root@CentOS ~]# exit logout Connection to CentOS closed.
配置JAVA开发环境
[root@CentOS ~]# rpm -ivh jdk-8u171-linux-x64.rpm Preparing... ########################################### [100%] 1:jdk1.8 ########################################### [100%] Unpacking JAR files... tools.jar... plugin.jar... javaws.jar... deploy.jar... rt.jar... jsse.jar... charsets.jar... localedata.jar...
[root@CentOS ~]# vi .bashrc JAVA_HOME=/usr/java/latest PATH=$PATH:$JAVA_HOME/bin CLASSPATH=. export JAVA_HOME export PATH export CLASSPATH [root@CentOS ~]# source .bashrc [root@CentOS ~]# jps 1674 Jps
配置安装hadoop hdfs
[root@CentOS ~]# tar -zxf hadoop-2.6.0_x64.tar.gz -C /usr/ HADOOP_HOME=/usr/hadoop-2.6.0 JAVA_HOME=/usr/java/latest PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin CLASSPATH=. export JAVA_HOME export PATH export CLASSPATH export HADOOP_HOME [root@CentOS ~]# source .bashrc [root@CentOS ~]# hadoop classpath /usr/hadoop-2.6.0/etc/hadoop:/usr/hadoop-2.6.0/share/hadoop/common/lib/:/usr/hadoop-2.6.0/share/hadoop/common/:/usr/hadoop-2.6.0/share/hadoop/hdfs:/usr/hadoop-2.6.0/share/hadoop/hdfs/lib/:/usr/hadoop-2.6.0/share/hadoop/hdfs/:/usr/hadoop-2.6.0/share/hadoop/yarn/lib/:/usr/hadoop-2.6.0/share/hadoop/yarn/:/usr/hadoop-2.6.0/share/hadoop/mapreduce/lib/:/usr/hadoop-2.6.0/share/hadoop/mapreduce/:/usr/hadoop-2.6.0/contrib/capacity-scheduler/*.jar
修改core-site.xml文件
[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/core-site.xml
<property> <name>fs.defaultFS</name> <value>hdfs://CentOS:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/hadoop-2.6.0/hadoop-${user.name}</value> </property>
[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<property> <name>dfs.replication</name> <value>1</value> </property>
[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/slaves
CentOS
初始化HDFS
[root@CentOS ~]# hdfs namenode -format ... 18/10/15 20:06:03 INFO namenode.NNConf: ACLs enabled? false 18/10/15 20:06:03 INFO namenode.NNConf: XAttrs enabled? true 18/10/15 20:06:03 INFO namenode.NNConf: Maximum size of an xattr: 16384 18/10/15 20:06:03 INFO namenode.FSImage: Allocated new BlockPoolId: BP-665637298-192.168.29.128-1539605163213 18/10/15 20:06:03 INFO common.Storage: Storage directory /usr/hadoop-2.6.0/hadoop-root/dfs/name has been successfully formatted.
18/10/15 20:06:03 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 18/10/15 20:06:03 INFO util.ExitUtil: Exiting with status 0 18/10/15 20:06:03 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at CentOS/192.168.29.128 ************************************************************/ [root@CentOS ~]# ls /usr/hadoop-2.6.0/ bin hadoop-root
lib LICENSE.txt README.txt share etc include libexec NOTICE.txt sbin
只须要在第一次启动HDFS的时候执行,之后重启无需执行该命令
启动HDFS服务
[root@CentOS ~]# start-dfs.sh Starting namenodes on [CentOS] CentOS: starting namenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-namenode-CentOS.out CentOS: starting datanode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-datanode-CentOS.out Starting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. RSA key fingerprint is 7c:95:67:06:a7:d0:fc:bc:fc:4d:f2:93:c2:bf:e9:31. Are you sure you want to continue connecting (yes/no)? yes 0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-CentOS.out [root@CentOS ~]# jps 2132 SecondaryNameNode 1892 NameNode 2234 Jps 1998 DataNode
用户能够经过浏览器访问:http://192.168.29.128:50070
HDFS Shell
[root@CentOS ~]# hadoop fs -help | hdfs dfs -help Usage: hadoop fs [generic options] [-appendToFile <localsrc> ... <dst>] [-cat [-ignoreCrc] <src> ...] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>] [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-cp [-f] [-p | -p[topax]] <src> ... <dst>] [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-help [cmd ...]] [-ls [-d] [-h] [-R] [<path> ...]] [-mkdir [-p] <path> ...] [-moveFromLocal <localsrc> ... <dst>] [-moveToLocal <src> <localdst>] [-mv <src> ... <dst>] [-put [-f] [-p] [-l] <localsrc> ... <dst>] [-rm [-f] [-r|-R] [-skipTrash] <src> ...] [-rmdir [--ignore-fail-on-non-empty] <dir> ...] [-tail [-f] <file>] [-text [-ignoreCrc] <src> ...] [-touchz <path> ...] [-usage [cmd ...]] [root@CentOS ~]# hadoop fs -copyFromLocal /root/hadoop-2.6.0_x64.tar.gz / [root@CentOS ~]# hdfs dfs -copyToLocal /hadoop-2.6.0_x64.tar.gz ~/
参考:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html
Java API 操做HDFS
Windows开发环境搭建
导入Maven依赖
<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.17</version> </dependency>
关闭hdfs权限
org.apache.hadoop.security.AccessControlException: Permission denied: user=HIAPAD, access=WRITE, inode="/":root:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) ...
配置hdfs-site.xml(方案1)
<property> <name>dfs.permissions.enabled</name> <value>false</value> </property>
修改JVM启动参数配置-DHADOOP_USER_NAME=root(方案2)
java xxx -DHADOOP_USER_NAME=root
执行chmod修改目录读写权限
[root@CentOS ~]# hdfs dfs -chmod -R 777 /
java API 案例
private FileSystem fileSystem; private Configuration conf; @Before public void before() throws IOException { conf=new Configuration(); conf.set("fs.defaultFS","hdfs://CentOS:9000"); conf.set("dfs.replication","1"); conf.set("fs.trash.interval","1"); fileSystem=FileSystem.get(conf); assertNotNull(fileSystem); } @Test public void testUpload01() throws IOException { InputStream is=new FileInputStream("C:\Users\HIAPAD\Desktop\买家须知.txt"); Path path = new Path("/bb.txt"); OutputStream os=fileSystem.create(path); IOUtils.copyBytes(is,os,1024,true); /byte[] bytes=new byte[1024]; while (true){ int n=is.read(bytes); if(n==-1) break; os.write(bytes,0,n); } os.close(); is.close();/ } @Test public void testUpload02() throws IOException { Path src=new Path("file:///C:\Users\HIAPAD\Desktop\买家须知.txt"); Path dist = new Path("/dd.txt"); fileSystem.copyFromLocalFile(src,dist); } @Test public void testDownload01() throws IOException { Path dist=new Path("file:///C:\Users\HIAPAD\Desktop\11.txt"); Path src = new Path("/dd.txt"); fileSystem.copyToLocalFile(src,dist); //fileSystem.copyToLocalFile(false,src,dist,true); } @Test public void testDownload02() throws IOException { OutputStream os=new FileOutputStream("C:\Users\HIAPAD\Desktop\22.txt"); InputStream is = fileSystem.open(new Path("/dd.txt")); IOUtils.copyBytes(is,os,1024,true); } @Test public void testDelete() throws IOException { Path path = new Path("/dd.txt"); fileSystem.delete(path,true); } @Test public void testListFiles() throws IOException { Path path = new Path("/"); RemoteIterator<LocatedFileStatus> files = fileSystem.listFiles(path, true); while(files.hasNext()){ LocatedFileStatus file = files.next(); System.out.println(file.getPath()+" "+file.getLen()+" "+file.isFile()); } } @Test public void testDeleteWithTrash() throws IOException { Trash trash=new Trash(conf); trash.moveToTrash(new Path("/aa.txt")); } @After public void after() throws IOException { fileSystem.close(); }
Hadoop MapReduce
MapReduce是一种编程模型,用于大规模数据集的并行运算。概念"Map(映射)"和"Reduce(归约)",是它们的主要思想,都是从函数式编程语言(适合在网络中传递方法)里借来的,还有从矢量编程语言里借来的特性。
Hadoop中MapReduce计算框架充分的利用了存储节点所在物理主机的内存、CPU、网络、少量磁盘完成对大数据集的分布式计算。框架通常会在全部的DataNode所在的物理主机上启动NodeManager服务,NodeManager服务用于管理该服务运行的物理节点的计算资源。除此以外系统通常会启动一个ResourceManager用于统筹整个计算过程当中的资源调度问题。
MapReduce计算核心思想是将一个大的计算任务,拆分红若干个小任务,每一个小任务独立运行,而且获得计算结果,通常是将计算结果存储在本地。当第一批次任务执行结束,系统会启动第二批次任务,第二批次的任务做用是将第一批次的计算临时结果经过网路下载聚集到本地,而后在本地执行最终的汇总计算。
能够理解为当使用MapReduce执行大数据统计分析时,系统会将分析数据进行切分,咱们将切分信息 称为任务切片(实质是对分析目标数据的一种逻辑区间映射)。任务在执行的时候会更具任务切片的数目决定Map阶段计算的并行度。也就意味着Map阶段完成的是数据的局部计算。一个Map任务就表明着一个计算资源。当全部的Map任务都完成了对应区间的数据的局部计算后,Map任务会将计算结果存储在本地磁盘上。紧接着系统会按照系统预设的汇总并行度启动多个Reduce任务对Map阶段计算结果进行汇总,而且将结果内容输出到HDFS、MySQL、NoSQL中。
Map Reduce 2 架构
ResourceManager:统筹管理计算资源
NodeManager:启动计算资源(Container)例如:MRAppMaster、YarnChild同时NM链接RM汇报自身一些资源占用信息。
MRAppMaster:应用的Master负责任务计算过程当中的任务监控、故障转移,每一个Job只有一个。
YARNChild:表示MapTask、ReduceTask的总称,表示一个计算进程。
Yarn架构图
构建MR运行环境
[root@CentOS ~]# cp /usr/hadoop-2.6.0/etc/hadoop/mapred-site.xml.template /usr/hadoop-2.6.0/etc/hadoop/mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
etc/hadoop/yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>CentOS</value> </property>
启动YARN
[root@CentOS ~]# start-yarn.sh [root@CentOS ~]# jps 5250 Jps 4962 ResourceManager 5043 NodeManager 3075 DataNode 3219 SecondaryNameNode 2959 NameNode
MR 第一个HelloWorld
INFO 192.168.0.1 wx com.baizhi.service.IUserService#login INFO 192.168.0.4 wx com.baizhi.service.IUserService#login INFO 192.168.0.2 wx com.baizhi.service.IUserService#login INFO 192.168.0.3 wx com.baizhi.service.IUserService#login INFO 192.168.0.1 wx com.baizhi.service.IUserService#login
SQL
create table t_access( level varchar(32), ip varchar(64), app varchar(64), server varchar(128) ) select ip,sum(1) from t_access group by ip; reduce(ip,[1,1,1,..]) map(ip,1)
Mapper逻辑
public class AccessMapper extends Mapper<LongWritable,Text,Text,IntWritable> { /** *INFO 192.168.0.1 wx com.baizhi.service.IUserService#login * @param key : 行字节偏移量 * @param value:一行文本数据 * @param context * @throws IOException * @throws InterruptedException */ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] tokens=value.toString().split(" "); String ip=tokens[1]; context.write(new Text(ip),new IntWritable(1)); } }
Reducer逻辑
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class AccessReducer extends Reducer<Text,IntWritable,Text,IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int total=0; for (IntWritable value : values) { total+=value.get(); } context.write(key,new IntWritable(total)); } }
任务提交
public class CustomJobSubmitter extends Configured implements Tool {
public int run(String[] args) throws Exception { //1.封装job Configuration conf=getConf(); Job job=Job.getInstance(conf); //2.设置分析|输出数据格式类型 job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); //3.设置数据读入和写出路径 Path src = new Path("/demo/access"); TextInputFormat.addInputPath(job,src); Path dst = new Path("/demo/res");//必须为null TextOutputFormat.setOutputPath(job,dst); //4.设置Mapper和Reducer逻辑 job.setMapperClass(AccessMapper.class); job.setReducerClass(AccessReducer.class); //5.设置Mapper和Reducer的输出k/v类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //6.提交任务 job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new CustomJobSubmitter(),args); }
}
任务提交
InputFormat/OutputFormat
\t
分割一行数据(默认) 切片计算规则 :以文件为单位,以SpliSize作切割 mapreduce.input.keyvaluelinerecordreader.key.value.separator=\tMap Reduce Shuffle(洗牌)
掌握MR任务提交源码流程
解决MR任务计算过程当中的Jar包依赖
JobSubmitter(DFS|YARNRunner) submitJobInternal(Job.this, cluster); checkSpecs(job);#检查输出目录是否存在 JobID jobId = submitClient.getNewJobID();#获取jobid #构建资源目录 Path submitJobDir = new Path(jobStagingArea, jobId.toString()); #代码jar|上传第三方资源jar|配置文件 copyAndConfigureFiles(job, submitJobDir); int maps = writeSplits(job, submitJobDir);#任务切片 writeConf(conf, submitJobFile);#上传任务上下文信息
job.xml status = submitClient.submitJob(#提交任务给ResourceManager jobId, submitJobDir.toString(), job.getCredentials());
解决程序在运行期间的jar包依赖问题
提交时依赖,配置HADOOP_CLASSPATH
HADOOP_CLASSPATH=/root/mysql-connector-java-xxxx.jar HADOOP_HOME=/usr/hadoop-2.6.0 JAVA_HOME=/usr/java/latest PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin CLASSPATH=. export JAVA_HOME export PATH export CLASSPATH export HADOOP_HOME export HADOOP_CLASSPATH
通常是在任务提交初期,须要链接第三方数据库,计算任务切片。
使用正则表达式提取子串
112.116.25.18 - [11/Aug/2017:09:57:24 +0800] "POST /json/startheartbeatactivity.action HTTP/1.1" 200 234 "http://wiki.wang-inc.com/pages/resumedraft.action?draftId=12058756&draftShareId=014b0741-df00-4fdc-90ca-4bf934f914d1" Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 - 0.023 0.023 12.129.120.121:8090 200 String regex="^(\\d{3}\\.\\d{3}\\.\\d{1,3}\\.\\d{1,3})\\s-\\s\\[(.*)\\]\\s\".*\"\\s(\\d+)\\s(\\d+)\\s.*"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(input); if(matcher.matches()){ String value= matcher.group(1);//获取第一个()匹配的内容 }
自定义WritableComparable
如何干预MR程序的分区策略
public class CustomPartitioner<K, V> extends Partitioner<K, V> {
/** Use {@link Object#hashCode()} to partition. */ public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; }
} job.setPartitionerClass(CustomPartitioner.class); //或者 conf.setClass("mapreduce.job.partitioner.class",CustomPartitioner.class,Partitioner.class);
为什么在作MR计算的时候,会产生数据倾斜?
由于不合理的KEY,致使了数据的分布不均匀。选择合适的key做为统计依据,使得数据可以在爱分区均匀分布。通常须要程序员对分析的数据有必定的预判!
能够有效,减小Reduce Shuffle过程的网络带宽占用。可能在计算过程当中须要消耗额外的CPU进行数据的压缩和解压缩。
conf.setBoolean("mapreduce.map.output.compress", true); conf.setClass("mapreduce.map.output.compress.codec", GzipCodec.class, CompressionCodec.class);
若是是本地仿真可能会抛出not a gzip file错误,所以推荐你们在集群环境下测试!
job.setCombinerClass(CombinerReducer.class);
CombinerReducer实际就是一个Class extends Reducer,combiner通常发生在溢写阶段和溢写文件合并阶段。
HDFS|YRAN HA
环境准备
安装CentOS主机-物理节点
CentOSA CentOSB CentOSC
192.168.29.129 192.168.29.130 192.168.29.131
基础配置
主机名和IP映射关系
[root@CentOSX ~]# clear [root@CentOSX ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.128.133 CentOSA 192.168.128.134 CentOSB 192.168.128.135 CentOSC
关闭防火墙
[root@CentOSX ~]# service iptables stop iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Flushing firewall rules: [ OK ] iptables: Unloading modules: [ OK ] [root@CentOSX ~]# chkconfig iptables off
SSH免密码登陆
[root@CentOSX ~]# ssh-keygen -t rsa [root@CentOSX ~]# ssh-copy-id CentOSA [root@CentOSX ~]# ssh-copy-id CentOSB [root@CentOSX ~]# ssh-copy-id CentOSC
同步全部物理节点的时钟
[root@CentOSA ~]# date -s '2018-09-16 11:28:00' Sun Sep 16 11:28:00 CST 2018 [root@CentOSA ~]# clock -w [root@CentOSA ~]# date Sun Sep 16 11:28:13 CST 2018
安装JDK配置JAVA_HOME环境变量
[root@CentOSX ~]# rpm -ivh jdk-8u171-linux-x64.rpm [root@CentOSX ~]# vi .bashrc JAVA_HOME=/usr/java/latest CLASSPATH=. PATH=$PATH:$JAVA_HOME/bin export JAVA_HOME export CLASSPATH export PATH [root@CentOSX ~]# source .bashrc
安装zookeeper&启动Zookeeper
[root@CentOSX ~]# tar -zxf zookeeper-3.4.6.tar.gz -C /usr/ [root@CentOSX ~]# vi /usr/zookeeper-3.4.6/conf/zoo.cfg tickTime=2000 dataDir=/root/zkdata clientPort=2181 initLimit=5 syncLimit=2 server.1=CentOSA:2887:3887 server.2=CentOSB:2887:3887 server.3=CentOSC:2887:3887 [root@CentOSX ~]# mkdir /root/zkdata [root@CentOSA ~]# echo 1 >> zkdata/myid [root@CentOSB ~]# echo 2 >> zkdata/myid [root@CentOSC ~]# echo 3 >> zkdata/myid [root@CentOSX zookeeper-3.4.6]# ./bin/zkServer.sh start zoo.cfg JMX enabled by default Using config: /usr/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED [root@CentOSX zookeeper-3.4.6]# ./bin/zkServer.sh status zoo.cfg
Hadoop配置与安装
[root@CentOSX ~]# tar -zxf hadoop-2.6.0_x64.tar.gz -C /usr/ [root@CentOSX ~]# vi .bashrc HADOOP_HOME=/usr/hadoop-2.6.0 JAVA_HOME=/usr/java/latest CLASSPATH=. PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export JAVA_HOME export CLASSPATH export PATH export HADOOP_HOME [root@CentOSX ~]# source .bashrc
core-site.xml
<property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/hadoop-2.6.0/hadoop-${user.name}</value> </property> <property> <name>fs.trash.interval</name> <value>30</value> </property> <property> <name>net.topology.script.file.name</name> <value>/usr/hadoop-2.6.0/etc/hadoop/rack.sh</value> </property>
建立机架脚本文件,该脚本能够根据IP判断机器所处的物理位置
[root@CentOSX ~]# vi /usr/hadoop-2.6.0/etc/hadoop/rack.sh while [ $# -gt 0 ] ; do nodeArg=$1 exec</usr/hadoop-2.6.0/etc/hadoop/topology.data result="" while read line ; do ar=( $line ) if [ "${ar[0]}" = "$nodeArg" ] ; then result="${ar[1]}" fi done shift if [ -z "$result" ] ; then echo -n "/default-rack" else echo -n "$result " fi done [root@CentOSX ~]# chmod u+x /usr/hadoop-2.6.0/etc/hadoop/rack.sh [root@CentOSX ~]# vi /usr/hadoop-2.6.0/etc/hadoop/topology.data 192.168.128.133 /rack1 192.168.128.134 /rack1 192.168.128.135 /rack2 [root@CentOSX ~]# /usr/hadoop-2.6.0/etc/hadoop/rack.sh 192.168.128.133 /rack1
hdfs-site.xml
<property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>ha.zookeeper.quorum</name> <value>CentOSA:2181,CentOSB:2181,CentOSC:2181</value> </property> <property> <name>dfs.nameservices</name> <value>mycluster</value> </property> <property> <name>dfs.ha.namenodes.mycluster</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn1</name> <value>CentOSA:9000</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn2</name> <value>CentOSB:9000</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://CentOSA:8485;CentOSB:8485;CentOSC:8485/mycluster</value> </property> <property> <name>dfs.client.failover.proxy.provider.mycluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/root/.ssh/id_rsa</value> </property>
slaves
CentOSC[root@CentOSX ~]# vi /usr/hadoop-2.6.0/etc/hadoop/slaves
CentOSA CentOSB CentOSC
HDFS启动
[root@CentOSX ~]# hadoop-daemon.sh start journalnode //等上10秒钟,再进行下一步操做 [root@CentOSA ~]# hdfs namenode -format [root@CentOSA ~]# hadoop-daemon.sh start namenode [root@CentOSB ~]# hdfs namenode -bootstrapStandby (下载active的namenode元数据) [root@CentOSB ~]# hadoop-daemon.sh start namenode [root@CentOSA|B ~]# hdfs zkfc -formatZK (能够在CentOSA或者CentOSB任意一台注册namenode信息) [root@CentOSA ~]# hadoop-daemon.sh start zkfc (哨兵) [root@CentOSB ~]# hadoop-daemon.sh start zkfc (哨兵) [root@CentOSX ~]# hadoop-daemon.sh start datanode
查看机架
[root@CentOSA ~]# hdfs dfsadmin -printTopology Rack: /rack1 192.168.29.129:50010 (CentOSA) 192.168.29.130:50010 (CentOSB) Rack: /rack2 192.168.29.131:50010 (CentOSC)
集群启动和关闭
[root@CentOSA ~]# start|stop-dfs.sh #任意一台均可以执行
若是重启过程当中,由于journalnode初始化过慢,致使namenode启动失败,请在执行失败的namenode节点上执行hadoop-daemon.sh start namenode
构建Yarn的集群
修改mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
修改yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>cluster1</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>CentOSB</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>CentOSC</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>CentOSA:2181,CentOSB:2181,CentOSC:2181</value> </property>
启动YARN
[root@CentOSB ~]# yarn-daemon.sh start resourcemanager [root@CentOSC ~]# yarn-daemon.sh start resourcemanager [root@CentOSX ~]# yarn-daemon.sh start nodemanager
查看ResourceManager HA状态
[root@CentOSA ~]# yarn rmadmin -getServiceState rm1 active [root@CentOSA ~]# yarn rmadmin -getServiceState rm2 standby