今天的数据是动态的和被驱动的应用。一个商业应用驱动的的新时代的发展由动态产业如网络、社会、移动和物联网等与新数据类型和新的数据模型造成数据集。这些应用程序是迭代的,相关的数据模型一般是半结构化的,非模式化和不断进化的。半结构化数据模型能够被复杂/嵌套,非模式化,可以在每一行和不一样领域持续发展达到频繁添加和删除以知足业务需求。apache
本教程向您展现了如何查询本地动态数据集,如JSON和在几分钟内查询来自任何类型的数据派生的洞察力。被使用的数据集的例子是Yelp签到数据集,它具备如下结构:json
check-in { 'type': 'checkin', 'business_id': (encrypted business id), 'checkin_info': { '0-0': (number of checkins from 00:00 to 01:00 on all Sundays), '1-0': (number of checkins from 01:00 to 02:00 on all Sundays), ... '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays), ... '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays) }, # if there was no checkin for a hour-day block it will not be in the dataset }
值得重复的评论这个代码片断的底部:数组
# if there was no checkin for a hour-day block it will not be in the dataset.
你在checkin_info看到的元素名称是未知的前期而且每一行可能会有所不一样。数据,虽然简单,可是高度动态的数据。分析数据不须要先在相关平面结构表示这个数据集,由于您将在Hadoop使用任何其余SQL技术。网络
第1步:首先下载钻,若是你尚未这样作,到您的机器中函数
http://drill.apache.org/download/ tar -xvf apache-drill-0.9.0.tar
在本地桌面安装钻(嵌入模式)。你不须要Hadoop。oop
♣第二步:开始钻壳。code
bin/drill-embedded
第三步:开始使用SQL分析数据排序
首先,咱们先看一下数据集:教程
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/home/zj/software/drill/yelp/yelp_academic_dataset_checkin.json` limit 2; +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+ | checkin_info | type | business_id | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+ | {"3-4":1,"13-5":1,"6-6":1,"14-5":1,"14-6":1,"14-2":1,"14-3":1,"19-0":1,"11-5":1,"13-2":1,"11-6":2,"11-3":1,"12-6":1,"6-5":1,"5-5":1,"9-2":1,"9-5":1,"9-6":1,"5-2":1,"7-6":1,"7-5":1,"7-4":1,"17-5":1,"8-5":1,"10-2":1,"10-5":1,"10-6":1} | checkin | JwUE5GmEO-sH1FuwJgKBlQ | | {"6-6":2,"6-5":1,"7-6":1,"7-5":1,"8-5":2,"10-5":1,"9-3":1,"12-5":1,"15-3":1,"15-5":1,"15-6":1,"16-3":1,"10-0":1,"15-4":1,"10-4":1,"8-2":1} | checkin | uGykseHzyS5xAMWoN6YUqA | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+
请注意:
本文档将钻例如输出的目的。在这种状况下钻输出不对齐。文档
你直接查询JSON文件中的数据。定义模式在hive仓库并非必须的。内部元素的名称 checkin_info列第一行和第二行之间是不一样的。
▲▲▲▲▲钻提供了一个名为KVGEN的函数(键值生成器)在处理复杂的数据时是有用的,其中包含任意地组成的动态图和未知的元素名称例如checkin_info。KVGEN转动态映射到一个数组的键值对,其中键表明动态元素名称。
让咱们在checkin_info元素应用KVGEN生成键-值对。
0: jdbc:drill:zk=local> SELECT KVGEN(checkin_info) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 2; +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | checkins | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] | | [{"key":"6-6","value":2},{"key":"6-5","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"8-5","value":2},{"key":"10-5","value":1},{"key":"9-3","value":1},{"key":"12-5","value":1},{"key":"15-3","value":1},{"key":"15-5","value":1},{"key":"15-6","value":1},{"key":"16-3","value":1},{"key":"10-0","value":1},{"key":"15-4","value":1},{"key":"10-4","value":1},{"key":"8-2","value":1}] | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
钻提供了另外一个函数来操做复杂的数据称为“Flatten”,打破“KVGen”形成的键值对列表到单独的行进一步应用分析功能。
0: jdbc:drill:zk=local> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 20; +--------------------------+ | checkins | +--------------------------+ | {"key":"3-4","value":1} | | {"key":"13-5","value":1} | | {"key":"6-6","value":1} | | {"key":"14-5","value":1} | | {"key":"14-6","value":1} | | {"key":"14-2","value":1} | | {"key":"14-3","value":1} | | {"key":"19-0","value":1} | | {"key":"11-5","value":1} | | {"key":"13-2","value":1} | | {"key":"11-6","value":2} | | {"key":"11-3","value":1} | | {"key":"12-6","value":1} | | {"key":"6-5","value":1} | | {"key":"5-5","value":1} | | {"key":"9-2","value":1} | | {"key":"9-5","value":1} | | {"key":"9-6","value":1} | | {"key":"5-2","value":1} | | {"key":"7-6","value":1} | +--------------------------+
能够从数据中快速的得到值,经过在数据集应用KVGEN和FLATTEN函数动态数据集在fly——no须要耗时的模式定义和数据存储中间格式。
夷为平地的数据输出,您可使用标准SQL功能,如过滤器、聚合、排序。 让咱们看看几个例子。
▲▲▲▲▲在Yelp数据集获取签到记录的总数
0: jdbc:drill:zk=local> SELECT SUM(checkintbl.checkins.`value`) AS TotalCheckins FROM ( . . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl . . . . . . . . . . . > ; +---------------+ | TotalCheckins | +---------------+ | 4713811 | +---------------+
▲▲▲▲▲获取专门为周日午夜的签到的数量
0: jdbc:drill:zk=local> SELECT SUM(checkintbl.checkins.`value`) AS SundayMidnightCheckins FROM ( . . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl WHERE checkintbl.checkins.key='23-0'; +------------------------+ | SundayMidnightCheckins | +------------------------+ | 8575 | +------------------------+
▲▲▲▲▲获取每星期几签到的数量
0: jdbc:drill:zk=local> SELECT `right`(checkintbl.checkins.key,1) WeekDay,sum(checkintbl.checkins.`value`) TotalCheckins from ( . . . . . . . . . . . > select flatten(kvgen(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY `right`(checkintbl.checkins.key,1) ORDER BY TotalCheckins; +------------+---------------+ | WeekDay | TotalCheckins | +------------+---------------+ | 1 | 545626 | | 0 | 555038 | | 2 | 555747 | | 3 | 596296 | | 6 | 735830 | | 4 | 788073 | | 5 | 937201 | +------------+---------------+
▲▲▲▲▲获取一天中每小时的签到数量
0: jdbc:drill:zk=local> SELECT SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) AS HourOfTheDay ,SUM(checkintbl.checkins.`value`) TotalCheckins FROM ( . . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) ORDER BY TotalCheckins; +--------------+---------------+ | HourOfTheDay | TotalCheckins | +--------------+---------------+ | 3 | 20357 | | 4 | 21076 | | 2 | 28116 | | 5 | 33842 | | 1 | 45467 | | 6 | 54174 | | 0 | 74127 | | 7 | 96329 | | 23 | 102009 | | 8 | 130091 | | 22 | 140338 | | 9 | 162913 | | 21 | 211949 | | 10 | 220687 | | 15 | 261384 | | 14 | 276188 | | 16 | 292547 | | 20 | 293783 | | 13 | 328373 | | 11 | 338675 | | 17 | 374186 | | 19 | 385381 | | 12 | 399797 | | 18 | 422022 | +--------------+---------------+
总结
在本教程中,您将冲浪结构化和半结构化数据没有任何前期或ETL模式管理。