项目实战从0到1之hive(34)大数据项目之电商数仓(用户行为数据采集)(二)

第4章 数据采集模块

4.1 Hadoop安装

1)集群规划: imgnode

注意:尽可能使用离线方式安装apache

4.1.1 项目经验之HDFS存储多目录

若HDFS存储空间紧张,须要对DataNode进行磁盘扩展。 1)在DataNode节点增长磁盘并进行挂载。app

img 2)在hdfs-site.xml文件中配置多目录,注意新挂载磁盘的访问权限问题。ide

<property>
   <name>dfs.datanode.data.dir</name>
<value>file:///${hadoop.tmp.dir}/dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>
</property>

4.1.2 项目经验之支持LZO压缩配置

1)hadoop自己并不支持lzo压缩,故须要使用twitter提供的hadoop-lzo开源组件。hadoop-lzo需依赖hadoop和lzo进行编译,编译步骤以下。oop

2)将编译好后的hadoop-lzo-0.4.20.jar 放入hadoop-2.7.2/share/hadoop/common/性能

[kgg@hadoop101 common]$ pwd
/opt/module/hadoop-2.7.2/share/hadoop/common
[kgg@hadoop101 common]$ ls
hadoop-lzo-0.4.20.jar

3)同步hadoop-lzo-0.4.20.jar到hadoop10二、hadoop103测试

[kgg@hadoop101 common]$ xsync hadoop-lzo-0.4.20.jar

4)core-site.xml增长配置支持LZO压缩url

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
   <name>io.compression.codec.lzo.class</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
</configuration>

5)同步core-site.xml到hadoop10二、hadoop103code

[kgg@hadoop101 hadoop]$ xsync core-site.xml

6)启动及查看集群orm

[kgg@hadoop101 hadoop-2.7.2]$ sbin/start-dfs.sh
[kgg@hadoop102 hadoop-2.7.2]$ sbin/start-yarn.sh

7)测试

yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec /input /output

8)为lzo文件建立索引

hadoop jar ./share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /output

4.1.3 项目经验之基准测试

1) 测试HDFS写性能 测试内容:向HDFS集群写10个128M的文件

[kgg@hadoop101 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 128MB​19/05/02 11:44:26 INFO fs.TestDFSIO: TestDFSIO.1.819/05/02 11:44:26 INFO fs.TestDFSIO: nrFiles = 1019/05/02 11:44:26 INFO fs.TestDFSIO: nrBytes (MB) = 128.019/05/02 11:44:26 INFO fs.TestDFSIO: bufferSize = 100000019/05/02 11:44:26 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO19/05/02 11:44:28 INFO fs.TestDFSIO: creating control file: 134217728 bytes, 10 files19/05/02 11:44:30 INFO fs.TestDFSIO: created control files for: 10 files19/05/02 11:44:30 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.103:803219/05/02 11:44:31 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.103:803219/05/02 11:44:32 INFO mapred.FileInputFormat: Total input paths to process : 1019/05/02 11:44:32 INFO mapreduce.JobSubmitter: number of splits:1019/05/02 11:44:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1556766549220_000319/05/02 11:44:34 INFO impl.YarnClientImpl: Submitted application application_1556766549220_000319/05/02 11:44:34 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1556766549220_0003/19/05/02 11:44:34 INFO mapreduce.Job: Running job: job_1556766549220_000319/05/02 11:44:47 INFO mapreduce.Job: Job job_1556766549220_0003 running in uber mode : false19/05/02 11:44:47 INFO mapreduce.Job:  map 0% reduce 0%19/05/02 11:45:05 INFO mapreduce.Job:  map 13% reduce 0%19/05/02 11:45:06 INFO mapreduce.Job:  map 27% reduce 0%19/05/02 11:45:08 INFO mapreduce.Job:  map 43% reduce 0%​19/05/02 11:45:09 INFO mapreduce.Job:  map 60% reduce 0%19/05/02 11:45:10 INFO mapreduce.Job:  map 73% reduce 0%19/05/02 11:45:15 INFO mapreduce.Job:  map 77% reduce 0%19/05/02 11:45:18 INFO mapreduce.Job:  map 87% reduce 0%19/05/02 11:45:19 INFO mapreduce.Job:  map 100% reduce 0%19/05/02 11:45:21 INFO mapreduce.Job:  map 100% reduce 100%19/05/02 11:45:22 INFO mapreduce.Job: Job job_1556766549220_0003 completed successfully19/05/02 11:45:22 INFO mapreduce.Job: Counters: 51        File System Counters                FILE: Number of bytes read=856                FILE: Number of bytes written=1304826                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=2350                HDFS: Number of bytes written=1342177359                HDFS: Number of read operations=43                HDFS: Number of large read operations=0                HDFS: Number of write operations=12        Job Counters                 Killed map tasks=1                Launched map tasks=10                Launched reduce tasks=1                Data-local map tasks=8                Rack-local map tasks=2                Total time spent by all maps in occupied slots (ms)=263635                Total time spent by all reduces in occupied slots (ms)=9698                Total time spent by all map tasks (ms)=263635                Total time spent by all reduce tasks (ms)=9698                Total vcore-milliseconds taken by all map tasks=263635                Total vcore-milliseconds taken by all reduce tasks=9698                Total megabyte-milliseconds taken by all map tasks=269962240                Total megabyte-milliseconds taken by all reduce tasks=9930752        Map-Reduce Framework                Map input records=10                Map output records=50                Map output bytes=750                Map output materialized bytes=910                Input split bytes=1230                Combine input records