概要模式其实就是数据的全貌信息的获取,主要分为3种:sql
#HSQL SELECT MIN(num),MAX(num),COUNT(num) FROM table GROUP BY groupcol; #Pig b = GROUP a BY groupcol; c = FOREACH b GENERATE group, MIN(a.num), MAX(a.num), COUNT_STAR(a)
过滤模式是不改变原有记录,而寻求子集的设计模式,主要应用于以下方面:数据库
#HSQL SELECT * FROM table WHERE value<3; #Pig b = FILTER a BY value <3;
#HSQL SELECT * FROM table ORDER BY col DESC LIMIT 10; #Pig b = ORDER a BY col DESC; c = LIMIT b 10;
#HSQL SELECT DISTINCT * FROM table; #Pig b = DISTINCT a;
数据组织模式是将一组数据进行重组,重点在于将个别记录的价值放大到全局,主要有以下几个设计模式:设计模式
#HSQL ##在关系数据库中,不多;在RDBMS中解决相似问题方法通常是先对数据进行链接,而后在结果上分析 #Pig ##pig对于分层数据结构有必定支持,包括层次化的包和元组。 a = LOAD '/data/a' AS PigStorage('|'); b = LOAD '/data/b' AS PigStorage(','); group_c = COGROUP a BY $2, b BY $1; annalyzed = FOREACH group_c GENERATE udfs.ananlyze(group ,$1 ,$2); ...
#HSQL SELECT * FROM table ORDER BY col DESC; #Pig b = ORDER a BY col DESC;
#HSQL SELECT * FROM table ORDER BY RAND(); #Pig b = GROUP a BY RANDOM(); #随机 c = FOREACH b GENERATE FLATTEN(a);#分组打平
链接模式是对于多处数据进行组织的一种方法,主要有如下几种:数据结构
#HSQL SELECT column_name FROM table_name1 LEFT JOIN table_name2 ON table_name1.column_name=table_name2.column_name #Pig C = JOIN A BY a1 LEFT OUTER,B BY b1; #左外,也能够:{左右全}外 C = JOIN A BY a1,B BY b1; #内
#pig #只有内链接和左外才支持这种复制连接优化模式 #除了第一个数据集之外,要求全部的数据集都要在内存中 big = LOAD 'big_data' AS (b1,b2,b3); tiny = LOAD 'tiny_data' AS (t1,t2,t3); mini = LOAD 'mini_data' AS (m1,m2,m3); C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
#HSQL SELECT * FROM table a ,b; #Pig c = CROSS a , b ;