Manipulation of data under hadoop distributed file system

Mahbubul Majumder, PhD
Dec 4, 2014

Working with HDFS

Hadoop Distributed File System (HDFS)
- allows to access and manipulate massive amount of data in parallel computing environment
- one common mechanism is to write MapReduce jobs
- common language is JAVA while you can use other languages even R
- this requires low level programming and understanding of how the process works
Fortunately we have tools that do not require ow level programming
- example includes Pig, Hive, Impala
- these allows writing much high level data manipulating instructions that we have learnt so far
- today we are going to talk about Pig
- we will show a demonstration writing some Pig codes

Pig

We have discussed earlier, Pig works with HDFS using MapReduce framework
- don't have to worry about writing a low level MapReduce program
- the textual language is called Pig Latin
- follows the grammar of data manipulation that we learnt
- common verbs

LOAD
FILTER
GROUP
JOIN
ORDER
FOREACH
STORE

Pig will process your codes converting them into a sequence of complex MapReduce programs to optimize the task and utilize the parallel environment and hadoop distribution files ystem

Launching and exiting Pig

To start pig, from Linux shell command just use

$ pig

This will take you to grunt shell as below

grunt>

Write your Pig latin instructions from the grunt shell
Example: pig latin command from grunt shell

grunt> myDat = LOAD 'myDat.txt' AS (pclass:chararray, survived:int);
grunt> describe myDat;

To exit from grunt shell

grunt> quit;

Linux commands from Grunt shell

Pig allows certain Linux commands from grunt shell to explore files in HDFS

grunt> clear
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera

grunt> cat output/part-r-00000 
1st  200
2nd  119
3rd  181
grunt> copyToLocal output/part-r-00000 output.txt

grunt> mkdir pigdata
grunt> cd pigdata
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata

grunt> ls
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata/pigdata  <dir>

grunt> rm pigdata

Viewing files in HDFS

We create a tab seperated data file myDat.txt and put it in hdfs

hadoop fs -put myDat.txt hdfs/myDat.txt

Pig allows to view that file from grunt shell

grunt> cat hdfs/myDat.txt
1st  0
1st  0
1st  1
1st  1
2nd  0
2nd  0
2nd  1
3rd  0
3rd  1

Loading and saving data

To load a data file in HDFS from a folder called hdfs

grunt> dat = LOAD 'hdfs/myDat.txt';

By default pig loads the tab separated file. if you want to load a CSV file use

grunt> dat = LOAD 'hdfs/myDat.csv' USING  PigStorage(',');

To specify the column names while reading the data

grunt> dat = LOAD 'hdfs/myDat.txt' AS 
             (pclass:chararray, survived:int);

To save the data into HDFS. By default the file will be saved as a tab sepratred file

grunt> STORE dat 'hdfs/ouput.txt';

Exploring data

Obtain the description of the data

grunt> DESCRIBE dat;
dat: {pclass: chararray,survived: int}

Display data on the screen

grunt> DUMP dat;
(1st,0)
(1st,0)
(1st,1)
(1st,1)
(2nd,0)
(2nd,0)
(2nd,1)
(3rd,0)
(3rd,1)

You may not want to display the whole file ! Display a subset of data instead.

grunt> d = LIMIT dat 2;
grunt> DUMP d;

Pig data types

Field
- a single element of data
- think of a cell of a data table
Tuple
- a collection of fields
- think of a row of a data table
Bag
- a collection of tupple
- think of a dataframe
- a bag may contain more than one tuple in a row

A field example

John

A Tuple example

(John, 20)

A Bag example

(John, 20) (student, M)
(King, 30) (student, M)
(Sila, 10) (student, F)

A = LOAD 'data' AS 
(t1:tuple(
  t1a:chararray, t1b:int),
 t2:tuple(
   t2a:chararray,
   t2b:chararray));

Selecting and filtering data

To select a specific column of the data

d = FOREACH dat GENERATE pclass;
DUMP d;

You can specify a column by number. Following codes will display second column

d = FOREACH dat GENERATE $1;
DUMP d;

To filter data by a column value

d = FILTER dat BY (pclass == '3rd') AND (survived > 0);
DUMP d;

(3rd,1)

Builtin function to aggregate

Pig provides some builtin functions to aggregate the data
- SUM
- AVG
- MIN
- MAX
- COUNT
You can define your own functions
The Pig script could be run from Linuc Shell
- write your pig scripts and functions and save the file as say pscript.pig
- run it using the following command

Grouping and summarising

Group the data we loaded earlier by pclass

groups = GROUP dat BY pclass;

Now obtain the groupwise summary and display the result on screen

result = FOREACH groups GENERATE group, SUM(myDat.survived) AS t;

DUMP result;
(1st,2)
(2nd,1)
(3rd,1)

Nested operation

FOREACH is used to select columns. This can be used nested into other commands

dat = LOAD 'hdfs/age.txt' AS (pclass:chararray, age:double);
gdat = GROUP dat BY pclass;

topage = FOREACH gdat {
  sorted = ORDER dat BY age DESC;
  eldest = LIMIT sorted 3;
  GENERATE group, eldest;
};

DUMP topage;

(1st,{(1st,35.5),(1st,35.0),(1st,20.2)})
(2nd,{(2nd,55.5),(2nd,45.5),(2nd,40.0)})
(3rd,{(3rd,65.2),(3rd,45.5),(3rd,43.3)})

Joining and concatenating data

Pig joins data tables as inner join by default

joined = JOIN dat1 BY (field1, field2), dat2 BY (filed1,field3);

Result contains all records from the relation specified on the right, but only matching records from the one specified on the left

joined = JOIN dat1 BY field1 RIGHT OUTER, dat2 BY field3;

To concatenate two data tables use the following command which works like rbind

cats = UNION dat1, dat2;

Case study: titanic data

We will now demonstrate some pig codes
- use virtual machine cloudera CDH 5.2X
- run using VMware Player or Virtual Box

Demonstration with titanic data

Reading assignment and references

Apache Hadoop web site
http://hadoop.apache.org/
Apache Hadoop shell command guide
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Pig Latin Basics
http://pig.apache.org/docs/r0.14.0/basic.html