基于Presto和superset搭建数据分析平台。 Presto能够做为数据仓库,可以链接多种数据库和NoSql,同时查询性能很高; Superset提供了Presto链接,方便数据可视化和dashboard生成。mysql
基本概念
##datawarehouse 数据仓库 整合各种数据库数据,面向主题,方便分析。存储元数据,模型信息,存储数据(建索引、缓存、分区、pre-aggregation)等。sql
- greenplum
- hive
##OLAP 一些列数据分析操做,好比pivoting, slicing, dicing, drilling;能够分析数据仓库也能够甚至是文件数据。数据库
- Mondrian 开源的OLAP引擎
- MOLAP 数据在DW,多维格式存储
- ROLAP 数据存在数据库
- 大数据领域不少sql-on-hadoop均可以看做OLAP引擎。Drill, Impala,Kylin,Phoenix,Druid,Greenplum,HAWQ,Pinot,Presto,SparkSql
##MDX OLAP的操做一般用MDX表达,查询多为数据库。OLAP服务会把MDX转为sql查询。缓存
##MPP: massive parallel processing 相对sql-on-hadoop,mpp架构不依赖hadoop/spark runtime,mpp具备原生的分布式执行引擎。架构
Presto w/ Hive and mysql
Presto属于MPP架构的分析性系统。官方介绍:分布式
Presto is a tool designed to efficiently query vast amounts of data using distributed queries. ... Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra. Presto was designed to handle data warehousing and analytics: data analysis, aggregating large amounts of data and producing reports. These workloads are often classified as Online Analytical Processing (OLAP).oop
相似数据仓库,Presto能够关联分析多种数据源的数据,包括常见的关系型数据和大数据存储。性能
例子http://getindata.com/tutorial-presto-combine-data-hive-mysql-one-sql-like-query/测试
部署组件大数据
- download hadoop 2.6 (deploy hdfs)
- hive 1.2.2 (deploy metaserver service)
- mysql
- deploy presto w/ catalog hive and mysql
测试数据
例子中经过Presto同时链接mysql和hive。mysql中存放结构化user信息,hive中存放日志数据。 Hive中数据量比较大,1915万行。 Mysql中900+行数据。
统计不一样国家用户的访问量占比:
SELECT u.country, COUNT(*) AS cnt FROM hive.tutorial.stream s JOIN mysql.tutorial.user u ON s.userid = u.userid GROUP BY u.country
Superset
开源BI系统,B/S架构。
##配置presto presto://192.168.56.101:8080/hive/tutorial ##sqllab 选择Presto做为Database,能够关联查询Presto catalog中的全部数据源。