Mahbubul Majumder, PhD
Dec 4, 2014
Hadoop Distributed File System (HDFS)
Fortunately we have tools that do not require ow level programming
Pig, Hive, ImpalaPig codesLOAD
FILTER
GROUP
JOIN
ORDER
FOREACH
STORE
$ pig
grunt>
Write your Pig latin instructions from the grunt shell
Example: pig latin command from grunt shell
grunt> myDat = LOAD 'myDat.txt' AS (pclass:chararray, survived:int);
grunt> describe myDat;
grunt shellgrunt> quit;
grunt> clear
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera
grunt> cat output/part-r-00000
1st 200
2nd 119
3rd 181
grunt> copyToLocal output/part-r-00000 output.txt
grunt> mkdir pigdata
grunt> cd pigdata
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata
grunt> ls
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata/pigdata <dir>
grunt> rm pigdata
myDat.txt and put it in hdfshadoop fs -put myDat.txt hdfs/myDat.txt
grunt> cat hdfs/myDat.txt
1st 0
1st 0
1st 1
1st 1
2nd 0
2nd 0
2nd 1
3rd 0
3rd 1
grunt> dat = LOAD 'hdfs/myDat.txt';
grunt> dat = LOAD 'hdfs/myDat.csv' USING PigStorage(',');
grunt> dat = LOAD 'hdfs/myDat.txt' AS
(pclass:chararray, survived:int);
grunt> STORE dat 'hdfs/ouput.txt';
grunt> DESCRIBE dat;
dat: {pclass: chararray,survived: int}
grunt> DUMP dat;
(1st,0)
(1st,0)
(1st,1)
(1st,1)
(2nd,0)
(2nd,0)
(2nd,1)
(3rd,0)
(3rd,1)
grunt> d = LIMIT dat 2;
grunt> DUMP d;
Field
Tuple
Bag
A field example
John
(John, 20)
(John, 20) (student, M)
(King, 30) (student, M)
(Sila, 10) (student, F)
A = LOAD 'data' AS
(t1:tuple(
t1a:chararray, t1b:int),
t2:tuple(
t2a:chararray,
t2b:chararray));
d = FOREACH dat GENERATE pclass;
DUMP d;
d = FOREACH dat GENERATE $1;
DUMP d;
d = FILTER dat BY (pclass == '3rd') AND (survived > 0);
DUMP d;
(3rd,1)
Pig provides some builtin functions to aggregate the data
You can define your own functions
The Pig script could be run from Linuc Shell
groups = GROUP dat BY pclass;
result = FOREACH groups GENERATE group, SUM(myDat.survived) AS t;
DUMP result;
(1st,2)
(2nd,1)
(3rd,1)
dat = LOAD 'hdfs/age.txt' AS (pclass:chararray, age:double);
gdat = GROUP dat BY pclass;
topage = FOREACH gdat {
sorted = ORDER dat BY age DESC;
eldest = LIMIT sorted 3;
GENERATE group, eldest;
};
DUMP topage;
(1st,{(1st,35.5),(1st,35.0),(1st,20.2)})
(2nd,{(2nd,55.5),(2nd,45.5),(2nd,40.0)})
(3rd,{(3rd,65.2),(3rd,45.5),(3rd,43.3)})
joined = JOIN dat1 BY (field1, field2), dat2 BY (filed1,field3);
joined = JOIN dat1 BY field1 RIGHT OUTER, dat2 BY field3;
rbindcats = UNION dat1, dat2;
Demonstration with titanic data
Apache Hadoop web site
http://hadoop.apache.org/
Apache Hadoop shell command guide
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Pig Latin Basics
http://pig.apache.org/docs/r0.14.0/basic.html