重磅开源 KSQL：用于 Apache Kafka 的流数据 SQL 引擎 2017.8.29

时间 2019-11-05

标签重磅开源 ksql 用于 apache kafka 数据 sql 引擎 2017.8.29 栏目 Apache 繁體版

原文原文链接

Kafka 的做者 Neha Narkhede 在 Confluent 上发表了一篇博文，介绍了Kafka 新引入的KSQL 引擎——一个基于流的SQL。推出KSQL 是为了下降流式处理的门槛，为处理Kafka 数据提供简单而完整的可交互式SQL 接口。KSQL 目前能够支持多种流式操做，包括聚合（aggregate）、链接（join）、时间窗口（window）、会话（session），等等。html

与传统 SQL 的主要区别git

KSQL 与关系型数据库中的 SQL 仍是有很大不一样的。传统的 SQL 都是即时的一次性操做，不论是查询仍是更新都是在当前的数据集上进行。而 KSQL 则不一样，KSQL 的查询和更新是持续进行的，并且数据集能够源源不断地增长。KSQL 所作的实际上是转换操做，也就是流式处理。github

KSQL 的适用场景sql

1. 实时监控数据库

一方面，能够经过 KSQL 自定义业务层面的度量指标，这些指标能够实时得到。底层的度量指标没法告诉咱们应用程序的实际行为，因此基于应用程序生成的原始事件来自定义度量指标能够更好地了解应用程序的运行情况。另外一方面，能够经过 KSQL 为应用程序定义某种标准，用于检查应用程序在生产环境中的行为是否达到预期。apache

2. 安全检测编程

KSQL 把事件流转换成包含数值的时间序列数据，而后经过可视化工具把这些数据展现在 UI 上，这样就能够检测到不少威胁安全的行为，好比欺诈、入侵，等等。KSQL 为此提供了一种实时、简单而完备的方案。安全

3. 在线数据集成服务器

大部分的数据处理都会经历 ETL（Extract——Transform——Load）这样的过程，而这样的系统一般都是经过定时的批次做业来完成数据处理的，但批次做业所带来的延时在不少时候是没法被接受的。而经过使用 KSQL 和 Kafka 链接器，能够将批次数据集成转变成在线数据集成。好比，经过流与表的链接，能够用存储在数据表里的元数据来填充事件流里的数据，或者在将数据传输到其余系统以前过滤掉数据里的敏感信息。session

4. 应用开发

对于复杂的应用来讲，使用 Kafka 的原生 Streams API 或许会更合适。不过，对于简单的应用来讲，或者对于不喜欢 Java 编程的人来讲，KSQL 会是更好的选择。

KSQL 的核心抽象

KSQL 是基于 Kafka 的 Streams API 进行构建的，因此它的两个核心概念是流（Stream）和表（Table）。流是没有边界的结构化数据，数据能够被源源不断地添加到流当中，但流中已有的数据是不会发生变化的，即不会被修改也不会被删除。表就是流的视图，或者说它表明了可变数据的集合。它与传统的数据库表相似，只不过具有了一些流式语义，好比时间窗口，并且表中的数据是可变的。KSQL 将流和表集成在一块儿，容许将表明当前状态的表与表明当前发生事件的流链接在一块儿。

KSQL 架构

KSQL 是一个独立运行的服务器，多个 KSQL 服务器能够组成集群，能够动态地添加服务器实例。集群具备容错机制，若是一个服务器失效，其余服务器就会接管它的工做。KSQL 命令行客户端经过 REST API 向集群发起查询操做，能够查看流和表的信息、查询数据以及查看查询状态。由于是基于 Streams API 构建的，因此 KSQL 也沿袭了 Streams API 的弹性、状态管理和容错能力，同时也具有了仅一次（exactly once）语义。KSQL 服务器内嵌了这些特性，并增长了一个分布式SQL 引擎、用于提高查询性能的自动字节码生成机制，以及用于执行查询和管理的REST API。

Kafka+KSQL 要颠覆传统数据库

传统关系型数据库以表为核心，日志只不过是实现手段。而在以事件为中心的世界里，状况却刚好相反。日志成为了核心，而表几乎是以日志为基础，新的事件不断被添加到日志里，表的状态也所以发生变化。将 Kafka 做为中心日志，配置 KSQL 这个引擎，咱们就能够建立出咱们想要的物化视图，并且视图也会持续不断地获得更新。

KSQL 的将来

KSQL 目前还处于开发者预览阶段，做者还在收集社区的反馈。将来计划增长更多的特性，包括支持更丰富的SQL 语法，让KSQL 成为生产就绪的系统。

这里有 KSQL 的快速入门指南和一个演示程序。能够在Slack 的#KSQL 频道上向做者提供反馈信息，或者若是发现Bug，能够在 GitHub 上提出来。

KSQL - Streaming SQL for Apache Kafka

KSQL is now GA and officially supported by Confluent Inc. Get started with KSQL today.

KSQL is the streaming SQL engine for Apache Kafka. It provides a simple and completely interactive SQL interface for stream processing on Kafka; no need to write code in a programming language such as Java or Python. KSQL is distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more. You can find more KSQL tutorials and resources here if you are interested.

Click here to watch a screencast of the KSQL demo on YouTube.

Getting Started and Download

Stable Releases

Stable releases are published every four months and are officially supported by Confluent.

Download latest stable KSQL, which is included in Confluent Platform.
Follow the Quick Start.
Read the KSQL Documentation, notably the KSQL Tutorials and Examples, which include Docker-based variants.

Preview Releases

In addition to supported stable KSQL releases, we also provide preview releases. We encourage you to try them in development and testing environments and to take advantage of Confluent Community resources to get help and share feedback.

Download latest KSQL Preview.

Documentation

See KSQL documentation for the latest stable release.

Use Cases and Examples

Streaming ETL

Apache Kafka is a popular choice for powering data pipelines. KSQL makes it simple to transform data within the pipeline, readying messages to cleanly land in another system.

CREATE STREAM vip_actions AS
  SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';

Anomaly Detection

KSQL is a good fit for identifying patterns or anomalies on real-time data. By processing the stream as data arrives you can identify and properly surface out of the ordinary events with millisecond latency.

CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3;

Monitoring

Kafka's ability to provide scalable ordered messages with stream processing make it a common solution for log data monitoring and alerting. KSQL lends a familiar syntax for tracking, understanding, and managing alerts.

CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;

Join the Community

You can get help, learn how to contribute to KSQL, and find the latest news by connecting with the Confluent community.

Ask a question in the #ksql channel in our public Confluent Community Slack. Account registration is free and self-service.
Join the Confluent Google group.

Contributing

Contributions to the code, examples, documentation, etc. are very much appreciated.

Report issues and bugs directly in this GitHub project.
Learn how to work with the KSQL source code, including building and testing KSQL as well as contributing code changes to KSQL by reading our Development and Contribution guidelines.
One good way to get started is by tackling a newbie issue.

License

The project is licensed under the Confluent Community License.

Apache, Apache Kafka, Kafka, and associated open source project names are trademarks of the Apache Software Foundation.