1.架构处理流程图
2.日志产生器开发并结合log4j完成日志的输出
test\java
LoggerGenerator.java
import org.apache.log4j.Logger;
/**
* 模拟日志产生
*/
public class LoggerGenerator {
private static Logger logger = Logger.getLogger(LoggerGenerator.class.getName());
public static void main(String[] args) throws Exception{
int index = 0;
while(true) {
Thread.sleep(1000);
logger.info("value : " + index++);
}
}
}
test\resources
log4j.properties
log4j.rootLogger=INFO,stdout,flume
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = spark01
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = true
3.使用flume采集Log4j产生的日志
关键就是写出flume agent配置文件
cd $FLUME_HOME
cd conf
vim streaming.conf
agent1.sources =avro-source
agent1.channels=logger-channel
agent1.sinks=log-sink
#define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=0.0.0.0
agent1.sources.avro-source.port=41414
#define channel
agent1.channels.logger-channel.type=memory
#define sink
agent1.sinks.log-sink.type=logger
agent1.sources.avro-source.channels=logger-channel
agent1.sinks.log-sink.channel=logger-channel
保存好。启动配置文件
flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/streaming.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
4.使用kafkasink将flume收集到的数据输出到kafka
后台方式启动kafka
cd /opt/kafka_2.11
bin/kafka-server-start.sh -daemon config/server.properties
4.1创建一个topic
> bin/kafka-topics.sh --create --zookeeper spark01:2181 --replication-factor 1 --partitions 1 --topic streamingtopic
4.2.查看topic列表
> bin/kafka-topics.sh --list --zookeeper spark01:2181
4.3.消费消息
> bin/kafka-console-consumer.sh --zookeeper spark01:2181 --from-beginning --topic streamingtopic
4.4run LoggerGenerator.java
////////////////////////**************//////////////////************
5.sparkstreaming消费kafka的数据进行统计
package com.yys.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Spark Streaming对接Kafka
*/
object KafkaStreamingApp {
def main(args: Array[String]): Unit = {
if(args.length != 4) {
System.err.println("Usage: KafkaStreamingApp <zkQuorum> <group> <topics> <numThreads>")
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaReceiverWordCount")
.setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
// TODO... Spark Streaming如何对接Kafka
val messages = KafkaUtils.createStream(ssc, zkQuorum, group,topicMap)
// TODO... 自己去测试为什么要取第二个
//业务逻辑
messages.map(_._2).count().print()
ssc.start()
ssc.awaitTermination()
}
}
在idea上配置 program arguments参数: spark01:2181 test streamingtopic 1
6.本地测试和生产环境的使用
本地进行测试,在IDEA中运行LoggerGenerator,然后使用flume,kafka以及spark streaming进行处理操作。
在生产上肯定不是这样干的:
1)打包jar,执行LoggerGenerator类
2)flume,kafka和本地的测试是一样的
3)spark streaming的代码也是需要打成jar包,然后使用spark-submit的方式进行提交到我们的环境上执行。
可以根据实际情况选择运行模式:local/yarn/standalone/mesos
在生产上,整个流处理的流程都是一样的,区别在于业务逻辑的复杂性