脚本导数据,并设置运行队列sql
bin/beeline -u 'url' --outputformat=tsv -e "set mapreduce.job.queuename=queue_1" -e "select * from search_log where date <= 20150525 and date >= 20150523" > test.txt
将毫秒转换为日期数据库
select from_unixtime(cast(createTime/1000 as bigint)) from video_information;
对值类型为JSON的数据进行解析,以下就是一个字段data为json类型,其中的type表明日志类型,查询搜索日志。json
get_json_object(field, "$.field") select * from video where date=20151215 and get_json_object(data, "$.type")="search" limit 1;
JSONArray类型解析bash
表格有3个字段(asrtext array, asraudiourl string)app
asraudiourl | string | https://xxx |
asrtext | array | [{"text":"我是业主","confidence":1.0,"queryvendor":"1009","querydebug":"{\"recordId\":\"92e12fe7\",\"applicationId\":\"\",\"eof\":1,\"result\":{\"rec\":\"我 是 业主\",\"eof\":1}}","isfinal":true}] |
select asr, asraudiourl, asrvendor from aiservice.asr_info LATERAL VIEW explode(asrtext) asrTable As asr where date=20170523 and asrvendor='AiSpeech' and asr.isfinal=true and asr.text="我是业主" limit 1;ide
distinct误区oop
当distinct要求字段值不是null,当distinct x,y时,若是有null,会形成数据错误。因此咱们来人工把null转换成一个值url
select count(distinct requestid, CASE WHEN resid is null THEN "1" ELSE resid END)
$SPARK_HOME/bin/spark-submit --class com.test.SimilarQuery --master yarn-cluster --num-executors 40 --driver-memory 4g --executor-memory 2g --executor-cores 1 similar-query-0.0.1-SNAPSHOT-jar-with-dependencies.jar 20150819 /user/similar-query
hadoop jar game-query-down-0.0.1-SNAPSHOT.jar QueryDownJob -Dmapreduce.job.queuename=sns_default arg1 arg2
TextInputFormat:默认格式,读取文件的行,key是行的字节偏移量(LongWritable),value是行内容(Text)spa
KeyValueInputFormat:把行解析为键值对,key+\tab+valuedebug
SequenceFileInputFormat/SequenceFileOutputFormat:二进制格式,key/value都是用户自定义,input和output要保持一致
TextOutputFormat:输出纯文本,每行为key+\tab+value
NullOutputFormat:没有输出,忽略输出数据
MapFileOutputFormat:将结果写入一个MapFile中。MapFile中的键必须是排序的,因此在reducer中必须保证输出的键有序
DBInputFormat/DBOutputFormat:使用JDBC从关系数据库读文件或写文件