Hadoop and Mapreduce: a demonstration of wordcount

Mahbubul Majumder, PhD
Dec 2, 2014

Hadoop commands

  • Hadoop commands are usually written as
    • hadoop fs -<linux commands>
    • not all the Linux commands are supported though
    • common commands are ls, mkdir, rm. Visit our lecture listing some Linux commands.
    • for more hadoop shell commands view the guide
    • examples:
      • to list all the hadoop files (hdfs) use the following command
hadoop fs -ls
to create a directory called myHdfs use
hadoop fs -mkdir myHdfs

Copy data to/from HDFS

  • The syntax for copying files to/from Hadoop File System (HDFS) is
    hadoop fs -put <linux source> <hdfs destination>

  • Example: you can copy a file from Linux system to newly created HDFS directory using command

hadoop fs -put myLinuxFile.txt myHdfs/fileName.txt
  • Example: to copy a file to Linux system from HDFS directory use command
hadoop fs -get myHdfs/fileName.txt myLinuxFile.txt
  • By default the above commands will copy the Linux files in the current directory.
    • to copy in a specific Linux directory, you can specify the path
    • same for HDFS as well. specify the HDFS path to copy a file from there

High Performance Computing at PKI

  • We have high performance computing facilities at PKI. View more details.

    • several clusters crane, tusker, sandhill and red
    • you can request for as much as 500GB of RAM
    • crane has 452 nodes (64GB RAM per node)
  • For our class we have a small 10 nodes cluster setup at PKI

    • let us run hadoop from there
      • for login use
        ssh <username>@crane.unl.edu, it will require two way authentication and then
        ssh <username>@10.138.11.29

Hadoop demo to be shown at PKI

Wordcount: a case study

  • We want to count number of words in a file or a number of files

    • write a JAVA mapreduce program
    • test it in a virtual machine before running it in live system
  • We will use built in virtual machine provided by Cloudera

    • download QuickStart virtual machine from cloudera
      • if you download VMware version, you should be able to run it using VMware Player
      • the current version is CDH 5.2x which includes R, MySQL, Hadoop, Pig installed
      • size of the file is about 3GB and it requires at least 4GB of RAM (the more is better)
      • it is a single node machine with Linux (CentOS). View more details
      • user: cloudera and password: cloudera for all applications
  • Run your virtual machine using VMWare Player. You are now ready test your mapreduce codes.

Demonstration of wordcount problem

  • First create HDFS directory wordcount/input
hadoop fs -mkdir wordcount
hadoop fs -mkdir wordcount/input
  • Now create some files and we will count the number of words in the files
echo "Data Science Class and Data Science HW" > file0
echo "Nice Class Nice HW Nice Project and Nice program" > file1
  • Notice that the files are created in the Linux file system. We need to copy them to HDFS
hadoop fs -put file* wordcount/input
  • Here we used file* to indicate both files file0 and file1. Make sure the files are copied properly. To check that use
hadoop fs -ls wordcount/input

Demonstration of wordcount problem

  • Now create a directory in Linux system where we will build the java archive
mkdir wordcount_classes
  • Get the JAVA source code from here. Save codes in WordCount.java in your current directory.
  • Build the java archive (jar) after compiling. You can view your hadoop class path using command hadoop classpath. The classpath is given as the option of javac command.
javac -cp /usr/lib/hadoop/client-0.20/\* -d wordcount_classes WordCount.java

jar -cvf wordcount.jar -C wordcount_classes/ .
  • Now run the application. The following command will create an output directory in HDFS
hadoop jar wordcount.jar org.myorg.WordCount wordcount/input wordcount/output

Demonstration of wordcount problem

  • Now check the output of the program. Notice a file named part-00000 is created in output directory
hadoop fs -cat wordcount/output/part-00000
Class   2
Data    2
HW      2
Nice    4
Project 1
Science 2
and     2
program 1
  • Now remove everything what we created for this demonstration. Please do not do this for your actual work.
hadoop fs -rm -r wordcount
rm -r -f wordcount_classes
rm -f file* wordcount.jar
  • You can use other files and put them in the input directory . The same process will count all the words.

Reading assignment and references