Mahbubul Majumder, PhD
Dec 4, 2014
Hadoop Distributed File System (HDFS)
Fortunately we have tools that do not require ow level programming
Pig
, Hive
, Impala
Pig
codesLOAD
FILTER
GROUP
JOIN
ORDER
FOREACH
STORE
$ pig
grunt>
Write your Pig latin instructions from the grunt shell
Example: pig latin command from grunt
shell
grunt> myDat = LOAD 'myDat.txt' AS (pclass:chararray, survived:int);
grunt> describe myDat;
grunt
shellgrunt> quit;
grunt> clear
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera
grunt> cat output/part-r-00000
1st 200
2nd 119
3rd 181
grunt> copyToLocal output/part-r-00000 output.txt
grunt> mkdir pigdata
grunt> cd pigdata
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata
grunt> ls
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata/pigdata <dir>
grunt> rm pigdata
myDat.txt
and put it in hdfshadoop fs -put myDat.txt hdfs/myDat.txt
grunt> cat hdfs/myDat.txt
1st 0
1st 0
1st 1
1st 1
2nd 0
2nd 0
2nd 1
3rd 0
3rd 1
grunt> dat = LOAD 'hdfs/myDat.txt';
grunt> dat = LOAD 'hdfs/myDat.csv' USING PigStorage(',');
grunt> dat = LOAD 'hdfs/myDat.txt' AS
(pclass:chararray, survived:int);
grunt> STORE dat 'hdfs/ouput.txt';
grunt> DESCRIBE dat;
dat: {pclass: chararray,survived: int}
grunt> DUMP dat;
(1st,0)
(1st,0)
(1st,1)
(1st,1)
(2nd,0)
(2nd,0)
(2nd,1)
(3rd,0)
(3rd,1)
grunt> d = LIMIT dat 2;
grunt> DUMP d;
Field
Tuple
Bag
A field example
John
(John, 20)
(John, 20) (student, M)
(King, 30) (student, M)
(Sila, 10) (student, F)
A = LOAD 'data' AS
(t1:tuple(
t1a:chararray, t1b:int),
t2:tuple(
t2a:chararray,
t2b:chararray));
d = FOREACH dat GENERATE pclass;
DUMP d;
d = FOREACH dat GENERATE $1;
DUMP d;
d = FILTER dat BY (pclass == '3rd') AND (survived > 0);
DUMP d;
(3rd,1)
Pig provides some builtin functions to aggregate the data
You can define your own functions
The Pig script could be run from Linuc Shell
groups = GROUP dat BY pclass;
result = FOREACH groups GENERATE group, SUM(myDat.survived) AS t;
DUMP result;
(1st,2)
(2nd,1)
(3rd,1)
dat = LOAD 'hdfs/age.txt' AS (pclass:chararray, age:double);
gdat = GROUP dat BY pclass;
topage = FOREACH gdat {
sorted = ORDER dat BY age DESC;
eldest = LIMIT sorted 3;
GENERATE group, eldest;
};
DUMP topage;
(1st,{(1st,35.5),(1st,35.0),(1st,20.2)})
(2nd,{(2nd,55.5),(2nd,45.5),(2nd,40.0)})
(3rd,{(3rd,65.2),(3rd,45.5),(3rd,43.3)})
joined = JOIN dat1 BY (field1, field2), dat2 BY (filed1,field3);
joined = JOIN dat1 BY field1 RIGHT OUTER, dat2 BY field3;
rbind
cats = UNION dat1, dat2;
Demonstration with titanic data
Apache Hadoop web site
http://hadoop.apache.org/
Apache Hadoop shell command guide
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Pig Latin Basics
http://pig.apache.org/docs/r0.14.0/basic.html