Hadoop and Mapreduce: a demonstration of wordcount

Mahbubul Majumder, PhD
Dec 2, 2014

hadoop fs -ls

to create a directory called myHdfs use

hadoop fs -mkdir myHdfs

The syntax for copying files to/from Hadoop File System (HDFS) is
hadoop fs -put <linux source> <hdfs destination>
Example: you can copy a file from Linux system to newly created HDFS directory using command

hadoop fs -put myLinuxFile.txt myHdfs/fileName.txt

hadoop fs -get myHdfs/fileName.txt myLinuxFile.txt

By default the above commands will copy the Linux files in the current directory.
- to copy in a specific Linux directory, you can specify the path
- same for HDFS as well. specify the HDFS path to copy a file from there

We have high performance computing facilities at PKI. View more details.
- several clusters crane, tusker, sandhill and red
- you can request for as much as 500GB of RAM
- crane has 452 nodes (64GB RAM per node)
For our class we have a small 10 nodes cluster setup at PKI
- let us run hadoop from there
  - for login use
    ssh <username>@crane.unl.edu, it will require two way authentication and then
    ssh <username>@10.138.11.29

Hadoop demo to be shown at PKI

We want to count number of words in a file or a number of files
- write a JAVA mapreduce program
- test it in a virtual machine before running it in live system
We will use built in virtual machine provided by Cloudera
- download QuickStart virtual machine from cloudera
  - if you download VMware version, you should be able to run it using VMware Player
  - the current version is CDH 5.2x which includes R, MySQL, Hadoop, Pig installed
  - size of the file is about 3GB and it requires at least 4GB of RAM (the more is better)
  - it is a single node machine with Linux (CentOS). View more details
  - user: cloudera and password: cloudera for all applications
Run your virtual machine using VMWare Player. You are now ready test your mapreduce codes.

hadoop fs -mkdir wordcount
hadoop fs -mkdir wordcount/input

echo "Data Science Class and Data Science HW" > file0
echo "Nice Class Nice HW Nice Project and Nice program" > file1

Notice that the files are created in the Linux file system. We need to copy them to HDFS

hadoop fs -put file* wordcount/input

Here we used file* to indicate both files file0 and file1. Make sure the files are copied properly. To check that use

hadoop fs -ls wordcount/input

mkdir wordcount_classes

Get the JAVA source code from here. Save codes in WordCount.java in your current directory.
Build the java archive (jar) after compiling. You can view your hadoop class path using command hadoop classpath. The classpath is given as the option of javac command.

javac -cp /usr/lib/hadoop/client-0.20/\* -d wordcount_classes WordCount.java

jar -cvf wordcount.jar -C wordcount_classes/ .

Now run the application. The following command will create an output directory in HDFS

hadoop jar wordcount.jar org.myorg.WordCount wordcount/input wordcount/output

Now check the output of the program. Notice a file named part-00000 is created in output directory

hadoop fs -cat wordcount/output/part-00000

Class   2
Data    2
HW      2
Nice    4
Project 1
Science 2
and     2
program 1

Now remove everything what we created for this demonstration. Please do not do this for your actual work.

hadoop fs -rm -r wordcount
rm -r -f wordcount_classes
rm -f file* wordcount.jar

You can use other files and put them in the input directory . The same process will count all the words.