Druid学习之路 (四)Druid的数据采集格式

做者:Syn良子 出处:http://www.javashuo.com/article/p-yxqufjvn-hq.html 转载请注明出处css

Druid的数据采集格式


Druid能够采集非标准化的数据诸如JSON,CSV或者以某种分隔符隔开的TSV格式,固然还支持自定义格式.虽然大部分的文档使用JSON格式,可是经过druid来配置支持其余的限定格式也不是很难.html

当前支持的格式化数据


  1. 列表项

JSON工具

{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}

CSVui

2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143

TSVspa

2013-08-31T01:02:33Z    "Gypsy Danger"  "en"    "nuclear"   "true"  "true"  "false" "false" "article"   "North America" "United States" "Bay Area"  "San Francisco" 57  200 -143

须要注意的是CSV,TSV不能包含列头,这点在数据采集的时候必定要注意code

自定义格式


Druid支持使用正则解析和JavaScript来自定义数据格式.可是这种方式并无本身实现的Java解析器或者额外的流式处理工具效率更高.orm

配置数据采集的schema


什么是data schema?其实就是Druid的index数据摄取任务须要的数据源的描述的元数据.它主要描述要采集的数据类型,数据由哪些列构成,哪些是指标列,哪些是维度列,时间的粒度等.htm

以CSV格式举例blog

"parseSpec": {
"format" : "csv",
"timestampSpec" : {
  "column" : "timestamp"
},
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
"dimensionsSpec" : {
  "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}}

parseSpec指明了数据源格式,这里是format中代表是CSV格式,而后说明时间戳字段名是timestamp,数据字段名是columns里面那一堆,dimensionsSpec则表明哪些字段能够做为维度.ip

参考资料:Druid的数据格式

相关文章
相关标签/搜索