Manipulation of data under hadoop distributed file system

Mahbubul Majumder, PhD
Dec 4, 2014

Working with HDFS

  • Hadoop Distributed File System (HDFS)

    • allows to access and manipulate massive amount of data in parallel computing environment
    • one common mechanism is to write MapReduce jobs
    • common language is JAVA while you can use other languages even R
    • this requires low level programming and understanding of how the process works
  • Fortunately we have tools that do not require ow level programming

    • example includes Pig, Hive, Impala
    • these allows writing much high level data manipulating instructions that we have learnt so far
    • today we are going to talk about Pig
    • we will show a demonstration writing some Pig codes

Pig

  • We have discussed earlier, Pig works with HDFS using MapReduce framework
    • don't have to worry about writing a low level MapReduce program
    • the textual language is called Pig Latin
    • follows the grammar of data manipulation that we learnt
    • common verbs
LOAD
FILTER
GROUP
JOIN
ORDER
FOREACH
STORE
  • Pig will process your codes converting them into a sequence of complex MapReduce programs to optimize the task and utilize the parallel environment and hadoop distribution files ystem

Launching and exiting Pig

  • To start pig, from Linux shell command just use
$ pig
  • This will take you to grunt shell as below
grunt>
  • Write your Pig latin instructions from the grunt shell

  • Example: pig latin command from grunt shell

grunt> myDat = LOAD 'myDat.txt' AS (pclass:chararray, survived:int);
grunt> describe myDat;
  • To exit from grunt shell
grunt> quit;

Linux commands from Grunt shell

  • Pig allows certain Linux commands from grunt shell to explore files in HDFS
grunt> clear
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera

grunt> cat output/part-r-00000 
1st  200
2nd  119
3rd  181
grunt> copyToLocal output/part-r-00000 output.txt

grunt> mkdir pigdata
grunt> cd pigdata
grunt> pwd
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata

grunt> ls
hdfs://quickstart.cloudera:8020/user/cloudera/pigdata/pigdata  <dir>

grunt> rm pigdata

Viewing files in HDFS

  • We create a tab seperated data file myDat.txt and put it in hdfs
hadoop fs -put myDat.txt hdfs/myDat.txt
  • Pig allows to view that file from grunt shell
grunt> cat hdfs/myDat.txt
1st  0
1st  0
1st  1
1st  1
2nd  0
2nd  0
2nd  1
3rd  0
3rd  1

Loading and saving data

  • To load a data file in HDFS from a folder called hdfs
grunt> dat = LOAD 'hdfs/myDat.txt';
  • By default pig loads the tab separated file. if you want to load a CSV file use
grunt> dat = LOAD 'hdfs/myDat.csv' USING  PigStorage(',');
  • To specify the column names while reading the data
grunt> dat = LOAD 'hdfs/myDat.txt' AS 
             (pclass:chararray, survived:int);
  • To save the data into HDFS. By default the file will be saved as a tab sepratred file
grunt> STORE dat 'hdfs/ouput.txt';

Exploring data

  • Obtain the description of the data
grunt> DESCRIBE dat;
dat: {pclass: chararray,survived: int}
  • Display data on the screen
grunt> DUMP dat;
(1st,0)
(1st,0)
(1st,1)
(1st,1)
(2nd,0)
(2nd,0)
(2nd,1)
(3rd,0)
(3rd,1)
  • You may not want to display the whole file ! Display a subset of data instead.
grunt> d = LIMIT dat 2;
grunt> DUMP d;

Pig data types

  • Field

    • a single element of data
    • think of a cell of a data table
  • Tuple

    • a collection of fields
    • think of a row of a data table
  • Bag

    • a collection of tupple
    • think of a dataframe
    • a bag may contain more than one tuple in a row
  • A field example

  • John
    
    • A Tuple example
    (John, 20)
    
    • A Bag example
    (John, 20) (student, M)
    (King, 30) (student, M)
    (Sila, 10) (student, F)
    
    A = LOAD 'data' AS 
    (t1:tuple(
      t1a:chararray, t1b:int),
     t2:tuple(
       t2a:chararray,
       t2b:chararray));
    

    Selecting and filtering data

    • To select a specific column of the data
    d = FOREACH dat GENERATE pclass;
    DUMP d;
    
    • You can specify a column by number. Following codes will display second column
    d = FOREACH dat GENERATE $1;
    DUMP d;
    
    • To filter data by a column value
    d = FILTER dat BY (pclass == '3rd') AND (survived > 0);
    DUMP d;
    
    (3rd,1)
    

    Builtin function to aggregate

    • Pig provides some builtin functions to aggregate the data

      • SUM
      • AVG
      • MIN
      • MAX
      • COUNT
    • You can define your own functions

    • The Pig script could be run from Linuc Shell

      • write your pig scripts and functions and save the file as say pscript.pig
      • run it using the following command

    Grouping and summarising

    • Group the data we loaded earlier by pclass
    groups = GROUP dat BY pclass; 
    
    • Now obtain the groupwise summary and display the result on screen
    result = FOREACH groups GENERATE group, SUM(myDat.survived) AS t;
    
    DUMP result;
    (1st,2)
    (2nd,1)
    (3rd,1)
    

    Nested operation

    • FOREACH is used to select columns. This can be used nested into other commands
    dat = LOAD 'hdfs/age.txt' AS (pclass:chararray, age:double);
    gdat = GROUP dat BY pclass;
    
    topage = FOREACH gdat {
      sorted = ORDER dat BY age DESC;
      eldest = LIMIT sorted 3;
      GENERATE group, eldest;
    };
    
    DUMP topage;
    
    (1st,{(1st,35.5),(1st,35.0),(1st,20.2)})
    (2nd,{(2nd,55.5),(2nd,45.5),(2nd,40.0)})
    (3rd,{(3rd,65.2),(3rd,45.5),(3rd,43.3)})
    
    

    Joining and concatenating data

    • Pig joins data tables as inner join by default
    joined = JOIN dat1 BY (field1, field2), dat2 BY (filed1,field3);
    
    • Result contains all records from the relation specified on the right, but only matching records from the one specified on the left
    joined = JOIN dat1 BY field1 RIGHT OUTER, dat2 BY field3;
    
    • To concatenate two data tables use the following command which works like rbind
    cats = UNION dat1, dat2;
    

    Case study: titanic data

    Demonstration with titanic data

    Reading assignment and references