你可使用 Pig Latin 的 LOAD 运算符,从文件系统(HDFS / Local)将数据加载到Apache Pig中。html
load语句由两部分组成,用“=”运算符分隔。在左侧,须要提到咱们想要存储数据的关系的名称;而在右侧,咱们须要定义如何存储数据。下面给出了 Load 运算符的语法。node
Relation_name = LOAD 'Input file path' USING function as schema;
说明:web
relation_name - 咱们必须提到要存储数据的关系。要与后面的=之间留一个空格,否则报错shell
Input file path - 咱们必须提到存储文件的HDFS目录。(在MapReduce模式下)apache
function - 咱们必须从Apache Pig提供的一组加载函数中选择一个函数( BinStorage,JsonLoader,PigStorage,TextLoader )。session
Schema - 咱们必须定义数据的模式,能够定义所需的模式以下 -jvm
(column1 : data type, column2 : data type, column3 : data type);
须要加载的HDFS文件
[root@host ~]# hdfs dfs -cat hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29ide
pig按以下方式执行pig latin语句:
1.pig 对全部语句的语法语意进行确认函数
2.若是遇到dump或者store命令,Pig将顺序执行上面的全部语句。grunt
因此pig一些命令不会自动执行,须要经过其余命令触发,触发后连续一次性执行完毕。
加载数据
grunt> customer =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> b =foreach customer generate roleid; //foreach 的做用是基于数据的列进行数据转换
grunt> dump b
..................................
2018-06-15 14:54:22,671 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2018-06-15 14:54:22,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600)
(650)
(600)
(650)
(600)
(650)
(600)
(650)
grunt> b =foreach customer generate roleid,sex;
...................
2018-06-15 15:14:20,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
grunt> dump customer;
...........................
2018-06-15 14:59:53,355 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2018-06-15 14:59:53,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
存储数据
grunt> store customer into ' hdfs://localhost:9000/pig' USING PigStorage(',');
......................................
2018-06-15 15:23:24,887 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2018-06-15 15:23:24,888 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.4 0.17.0 root 2018-06-15 15:23:18 2018-06-15 15:23:24 UNKNOWN
Success!
Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_local1251771877_0005 1 0 n/a n/a n/a n/a 0 0 0 0 customer MAP_ONLY hdfs://localhost:9000/pig,
Input(s): Successfully read 8 records (28781246 bytes) from: "hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003"
Output(s): Successfully stored 8 records (28779838 bytes) in: "hdfs://localhost:9000/pig"
Counters: Total records written : 8 Total bytes written : 28779838 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Job DAG: job_local1251771877_0005
2018-06-15 15:23:24,888 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,890 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,890 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,892 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
查看hdfs:
[root@host ~]# hdfs dfs -ls -R /pig
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
-rw-r--r-- 1 root supergroup 0 2018-06-15 15:23 /pig/_SUCCESS
-rw-r--r-- 1 root supergroup 432 2018-06-15 15:23 /pig/part-m-00000
[root@host ~]# hdfs dfs -ls -R /pig/20180615/
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
-rw-r--r-- 1 root supergroup 0 2018-06-15 15:27 /pig/20180615/_SUCCESS
-rw-r--r-- 1 root supergroup 432 2018-06-15 15:27 /pig/20180615/part-m-00000
[root@host ~]# hdfs dfs -cat /pig/20180615/part-m-00000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
Load 语句会简单地将数据加载到Apache Pig中的指定关系中。要验证Load语句的执行,必须使用Diagnostic运算符。Pig Latin提供四种不一样类型的诊断运算符:
Dump 运算符用于运行Pig Latin语句,并在屏幕上显示结果,它一般用于调试目的。上面已经作了演示。
describe 运算符用于查看relation(关系)的模式。
grunt> describe b
b: {roleid: int,sex: int}
grunt> describe customer
customer: {roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int}
explain 运算符用于显示relation(关系)的逻辑,物理的和MapReduce的执行计划。
grunt> explain b
2018-06-15 16:08:01,235 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2018-06-15 16:08:01,236 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2018-06-15 16:08:01,237 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for customer: $1, $2, $3, $5
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
b: (Name: LOStore Schema: roleid#453:int,sex#457:int)ColumnPrune:OutputUids=[453, 457]ColumnPrune:InputUids=[453, 457]
|
|---b: (Name: LOForEach Schema: roleid#453:int,sex#457:int)
| |
| (Name: LOGenerate[false,false] Schema: roleid#453:int,sex#457:int)
| | |
| | (Name: Cast Type: int Uid: 453)
| | |
| | |---roleid:(Name: Project Type: bytearray Uid: 453 Input: 0 Column: (*))
| | |
| | (Name: Cast Type: int Uid: 457)
| | |
| | |---sex:(Name: Project Type: bytearray Uid: 457 Input: 1 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: roleid#453:bytearray)
| |
| |---(Name: LOInnerLoad[1] Schema: sex#457:bytearray)
|
|---customer: (Name: LOLoad Schema: roleid#453:bytearray,sex#457:bytearray)ColumnPrune:OutputUids=[453, 457]ColumnPrune:InputUids=[453, 457]ColumnPrune:RequiredColumns=[0, 4]RequiredFields:[0, 4]
#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
b: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-98
|
|---b: New For Each(false,false)[bag] - scope-97
| |
| Cast[int] - scope-92
| |
| |---Project[bytearray][0] - scope-91
| |
| Cast[int] - scope-95
| |
| |---Project[bytearray][1] - scope-94
|
|---customer: Load(hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003:PigStorage(',')) - scope-90
2018-06-15 16:08:01,239 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2018-06-15 16:08:01,240 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2018-06-15 16:08:01,240 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-99
Map Plan
b: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-98
|
|---b: New For Each(false,false)[bag] - scope-97
| |
| Cast[int] - scope-92
| |
| |---Project[bytearray][0] - scope-91
| |
| Cast[int] - scope-95
| |
| |---Project[bytearray][1] - scope-94
|
|---customer: Load(hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003:PigStorage(',')) - scope-90--------
Global sort: false
----------------
illustrate 使用该操做对pig latin语句进行单步执行
grunt> illustrate b
....................................
2018-06-15 16:14:59,291 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2018-06-15 16:14:59,292 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2018-06-15 16:14:59,292 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: customer[1,10],customer[-1,-1],b[4,3] C: R:
-----------------------------------------------------------------------------------------------------------------------------------------
| customer | roleid:int | name:chararray | dateid:datetime | addr:chararray | sex:int | level:int |
-----------------------------------------------------------------------------------------------------------------------------------------
| | 600 | null | 2017-11-15T14:50:05.000+08:00 | hunan changsha | 0 | 91 |
-----------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------
| b | roleid:int | sex:int |
----------------------------------------
| | 600 | 0 |
----------------------------------------
pig latin中,关系名,域名,函数名是区分大小写的,参数名和关键字是不区分大小写的。
GROUP 运算符用于在一个或多个关系中对数据进行分组,它收集具备相同key的数据。
grunt> dump b
....
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
grunt> group_id =group b by sex;
grunt> describe group_id
group_id: {group: int,b: {(roleid: int,sex: int)}}
grunt> dump group_id
...................................................
2018-06-15 16:27:20,311 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(0,{(600,0),(600,0),(600,0),(600,0)})
(1,{(650,1),(650,1),(650,1),(650,1)})
能够按多列分组
grunt> group_idsex =group b by (sex,roleid);
grunt> describe group_idsex
group_idsex: {group: (sex: int,roleid: int),b: {(roleid: int,sex: int)}}
grunt> dump group_idsex
((0,600),{(600,0),(600,0),(600,0),(600,0)})
((1,650),{(650,1),(650,1),(650,1),(650,1)})
也能够按全部列分组
grunt> group_all =group b ALL;
grunt> describe group_all;
group_all: {group: chararray,b: {(roleid: int,sex: int)}}
grunt> dump group_all;
...........
(all,{(650,1),(600,0),(650,1),(600,0),(650,1),(600,0),(650,1),(600,0)})
COGROUP 运算符的运做方式与 GROUP 运算符相同。两个运算符之间的惟一区别是 group 运算符一般用于一个关系,而 cogroup 运算符用于涉及两个或多个关系的语句。
grunt> distinctcustid =distinct b;
grunt> describe distinctcustid
distinctcustid: {roleid: int,sex: int}
grunt> dump distinctcustid
...........
(600,0)
(650,1)
grunt> cogroup1 =cogroup b by sex,distinctcustid by sex;
grunt> describe cogroup1;
cogroup1: {group: int,b: {(roleid: int,sex: int)},distinctcustid: {(roleid: int,sex: int)}}
grunt> dump cogroup1
(0,{(600,0),(600,0),(600,0),(600,0)},{(600,0)})
(1,{(650,1),(650,1),(650,1),(650,1)},{(650,1)})
grunt> cogroup2 =cogroup customer by sex,distinctcustid by sex;
grunt> describe cogroup2
cogroup2: {group: int,customer: {(roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int)},distinctcustid: {(roleid: int,sex: int)}}
grunt> dump cogroup2
............................
(0,{(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)},{(600,0)})
(1,{(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)},{(650,1)})
JOIN 运算符用于组合来自两个或多个关系的记录。在执行链接操做时,咱们从每一个关系中声明一个(或一组)元组做为key。 当这些key匹配时,两个特定的元组匹配,不然记录将被丢弃。链接能够是如下类型:
Self-join 用于将表与其自身链接,就像表是两个关系同样,临时重命名至少一个关系。一般,在Apache Pig中,为了执行self-join,咱们将在不一样的别名(名称)下屡次加载相同的数据。
Inner Join使用较为频繁;它也被称为等值链接。当两个表中都存在匹配时,内部链接将返回行。基于链接谓词(join-predicate),经过组合两个关系(例如A和B)的列值来建立新关系。查询将A的每一行与B的每一行进行比较,以查找知足链接谓词的全部行对。当链接谓词被知足时,A和B的每一个匹配的行对的列值被组合成结果行。
grunt> join1 =join distinctcustid by roleid,b by roleid;
grunt> describe join1;
join1: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
grunt> dump join1
......
(600,0,600,0)
(600,0,600,0)
(600,0,600,0)
(600,0,600,0)
(650,1,650,1)
(650,1,650,1)
(650,1,650,1)
(650,1,650,1)
Inner Join使用较为频繁;它也被称为等值链接。当两个表中都存在匹配时,内部链接将返回行。基于链接谓词(join-predicate),经过组合两个关系(例如A和B)的列值来建立新关系。查询将A的每一行与B的每一行进行比较,以查找知足链接谓词的全部行对。当链接谓词被知足时,A和B的每一个匹配的行对的列值被组合成结果行。
grunt> joinleft =join distinctcustid by roleid left,b by roleid;
grunt> describe joinleft
joinleft: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
right outer join操做将返回右表中的全部行,即便左表中没有匹配项。
grunt> joinright =join distinctcustid by roleid right,b by roleid;
grunt> describe joinright
joinright: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
当一个关系中存在匹配时,full outer join操做将返回行。
grunt> joinfull =join distinctcustid by roleid full,b by roleid;
grunt> describe joinfull
joinfull: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
咱们可使用多个key执行JOIN操做。关联的key顺序必须一致
grunt> joinbykeys =join distinctcustid by (roleid,sex),b by (roleid,sex);
grunt> describe joinbykeys
joinbykeys: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
CROSS 运算符计算两个或多个关系的向量积。笛卡尔积
grunt> crosstest =cross distinctcustid,b;
grunt> describe crosstest
crosstest: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
Pig Latin的 UNION 运算符用于合并两个关系的内容。要对两个关系执行UNION操做,它们的列和域必须相同。
grunt> customer1 =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00002' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> customer =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> union1 =union customer1,customer;
grunt> describe union1
union1: {roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int}
grunt> dump union1
..............................
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
SPLIT 运算符用于将关系拆分为两个或多个关系。
grunt> split union1 into customer1 if(sex==1),customer0 if(sex==0);
grunt> dump customer0;
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
grunt> dump customer1
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> cust1 =distinct customer0;
grunt> dump cust1
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
FILTER 运算符用于根据条件从关系中选择所需的元组。
grunt> uniondis =distinct union1;
grunt> dump uniondis
.....
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> filter_level =filter uniondis by (level<50);
grunt> dump filter_level
.................
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> filter_level2 =filter uniondis by (level>=50);
grunt> dump filter_level2
........................
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
DISTINCT 运算符用于从关系中删除冗余(重复)元组,上面已有实例
FOREACH 运算符用于基于列数据生成指定的数据转换。
grunt> foreach1 =foreach uniondis generate(name,sex,level);
grunt> dump foreach1
...........
((null,0,4))
((null,0,91))
((null,1,29))
grunt> foreach1 =foreach uniondis generate name,sex,level;
grunt> dump foreach1
.............
(null,0,4)
(null,0,91)
(null,1,29)
ORDER BY 运算符用于以基于一个或多个字段的排序顺序显示关系的内容。
grunt> orderby1 =order uniondis by level desc;
grunt> dump orderby1
.....
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
grunt> limit1 =limit uniondis 2;
grunt> dump limit1
......
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
--------------------------------------------------------------------------------------------
grunt> B = foreach customer generate $0 as id,$4 as sex,level;
grunt> dump B
.........
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
----------------------------------------------------------------------------------------------------------
generate:使造成,发生,产生;
span:跨越
extend:延伸;扩大;推广
Pig allows you to transform data in many ways. As a starting point, become familiar with these operators:
Use the FILTER operator to work with tuples or rows of data. Use the FOREACH operator to work with columns of data.
Use the GROUP operator to group data in a single relation. Use the COGROUP, inner JOIN, and outer JOIN operators to group or join data in two or more relations.
Use the UNION operator to merge the contents of two or more relations. Use the SPLIT operator to partition the contents of a relation into multiple relations.
Pig Latin provides operators that can help you debug your Pig Latin statements:
Use the DUMP operator to display results to your terminal screen.
Use the DESCRIBE operator to review the schema of a relation.
Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.
Pig provides shortcuts for the frequently used debugging operators (DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE). These shortcuts can be used in Grunt shell or within pig scripts. Following are the shortcuts supported by pig
\d alias - shourtcut for DUMP operator. If alias is ignored last defined alias will be used.
\de alias - shourtcut for DESCRIBE operator. If alias is ignored last defined alias will be used.
\e alias - shourtcut for EXPLAIN operator. If alias is ignored last defined alias will be used.
\i alias - shourtcut for ILLUSTRATE operator. If alias is ignored last defined alias will be used.
\q - To quit grunt shell