Big data technology

Mahbubul Majumder, PhD
Nov 25, 2014

Big data issues

We talked about big data in the beginning of the semester
- 3V is most commonly used characterization for big data challenge
Volume of the data may exceed the physical capacity
- Google receives over 2 million search queries every minute
- transactional data or sensor data are being stored every fraction of seconds
Variety of data does not have any bound
- YouTube, Facebook generate video, audio, image and text data
- Over 200 million emails are sent every minute
Velocity
- Experiments at CERN generate colossal amounts of data.
  - Particles collide 600 million times per second.
  - Their Data Center processes about one petabyte of data every day.

Challenges with big data

Storage and cost are big issues
- 99.999 percent of all the data are thrown away by CERN hadron collider
  - even with the rest of the data they don't have the capacity to store
  - they built a worldwide grid system to store and process the data
The Value of data depends on how they are being used
- we only have finite time to process
- processing power is not unbounded
How can this problem be solved?

Parallel computing

The limitation of processing power can be overcome at some scale
- if a process takes a year to run, we can use 365 processors to run it in a day
  - the challenge is to be able to parallelize the task
  - another challenge is the failure of one or more processors
  - we need some sort of automation in this process
How this challenge of parallel processing of data can be performed?
- google file system paper
- google mapreduce paper

What is Hadoop?

    __  __          __
   / / / /___ _____/ /___  ____  ____
  / /_/ / __ `/ __  / __ \/ __ \/ __ \
 / __  / /_/ / /_/ / /_/ / /_/ / /_/ /
/_/ /_/\__,_/\__,_/\____/\____/ .___/
                             /_/

It is a software platform
- allows us easily write and run data related applications
- facilitates processing and manipulating massive amount of data
- the processes are conveniently scalable
The core hadoop system runs with
- Hadoop Distributed File System (HDFS)
- MapReduce: a system for parallel processing of massive amount of data

Scalability of processes

A cluster consists of a set of servers that work together as if a server
- servers usually connected through Local Area Network (LAN)
- they are controlled and scheduled by software
Each server in a cluster system is called a node
- often nodes are configured with same hardware and software (usually Linux)
- nodes perform two jobs; store and process the data
- provides the scalability in both storage and process
The picture shows a simple homemade cluster setup taken from wikipedia

Why Hadoop

Hadoop allows are scalable process - to increase cluster capacity, add more nodes
Fault-tolerant system:
- includes redundant nodes, automatically restarts failed jobs
- each node processes a small amount of data
- files are replicated on different nodes
Cost is very low without compromising the performance
Made it possible to handle a massive amount of data conveniently

Who uses Hadoop

Facebook uses Apache Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning
Ebay is using it for Search optimization and Research
Twitter uses Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter
HCC (Holland Computing Center), University of Nebraska runs one medium-sized Hadoop cluster (1.6PB) to store and serve up physics data
Yahoo uses Hadoop to support research for Ad Systems and Web Search
To find out a detailed list you may like to visit the Hadoop users wiki

HDFS

Based on the google paper about google file system
- A scalable distributed file system for large distributed data-intensive applications
- provides fault tolerance while running on inexpensive commodity hardware
- delivers high aggregate performance
The file system works like Unix file system
- paths can be found similar way (such as /myHdfs/testFile.txt)
- ownership and permission of the files are similar
But HDFS is not exactly Unix file system
- files can't be modified once written
- to work on files, hadoop specific commands need to be used

How HDFS works

Name node controls everything
- called Master daemon
- contains namespace meta data
- monitors the slave nodes or data nodes
Data nodes stores data and run jobs
- called slave daemon
- each data set is divided into some blocks
- blocks of data are stored such that failure of a datanode does not damage the data
End user does not have to worry about these
- access files as if one file
- use hadoop specific commands

HDFS and Linux

HDFS works like Linux file system
- Linux directory structure such as (myHdfs/myFile.txt)
- you can view the list of hdfs files as if they are Linux files
You have to use hadoop specific commands to talk
- usually any hadoop command should start with hadoop fs
- for example to list all the files in a hdfs folder myHdfs would be
  - hadoop fs -ls myHdfs

Found 1 items
-rw-r--r-- training training 669879113 2014-10-11 23:50 myHdfs/test.csv

MapReduce

Based on 2nd google paper.
- each job is splitted into many small jobs
- There are two procedures
  - map filters and sort the data
  - reduce summarize the data
  - reduce is not necessary, you can have map only process
- this facilitates scalability and parallelization
Each job in MapReduce process are processed in datanode
- jobs are simple and nodes perform similar jobs
- when combined together, operation could be powerful and even complex
- it is necessary to write MapReduce program with great care

MapReduce example

Suppose we want to count number of words in a input file
This example will show details of how a MapReduce process will run

Hadoop and MapReduce

Hadoop splits the jobs into smaller pieces
- Number of jobs depends on the file size
MapReduce codes in Hadoop are usually written in JAVA
- but you can use R
- or you can use any language with hadoop streaming

Map function

Map function just assigns a number with each word

Mapping

Shuffle and sorting

Hadoop automatically shuffles and sorts the map output
If necessary hadoop will merge the output

Shuffling by hadoop

Reduce function

Reduce function just add those numbers for each words

Reducer task

MapReduce example

A MapReduce program's workflow to count the number of words

Pig

Pig is a high level programming language to work under HDFS
- it is an alternative to low level MapReduce programs
- works like dplyr language

titanic = LOAD '/user/training/titanic' AS (pclass, survived);
 groups = GROUP titanic BY pclass;
 result = FOREACH groups GENERATE group, SUM(titanic.survived) AS t;
DUMP result;

Pig interpreter works as a wrapper
- translates the user commands into MapReduce jobs
- sends the instructions to the cluster

Hive

Hive is another high level programming language to work with HDFS
- it also bases on MapReduce
Like pig, Hive interpreter works as a wrapper as well
- translates the user commands into MapReduce jobs
- sends the instructions to the cluster
Hive language is similar to SQL, it is called HiveQL

  SELECT pclass, sum(survived) as total_survived
    FROM titanic
GROUP BY pclass
ORDER BY pclass;

HBase

HBase is the Hadoop database
- can handle a table of size even petabyte
- tables can have thousands of columns
For HBase there is no SQL like language. So, the language is noSQL
- uses some sort of API type connections to interact with data

Impala

Impala is another high level programming language to work with HDFS and HBase
- it does not rely on MapReduce tasks, has its own technology
- it is maintained by Cloudera
- Cloudera claims that it is 10 times faster than Hive or any MapReduce2 jobs

R and Hadoop

You can use R to work under HDFS or HBase
- more information about RHadoop can be found in their github repository
- you need to install several packages
  - ravro - read and write files in avro format
  - plyrmr - higher level plyr-like data processing for structured data
  - rmr - for MapReduce functionality in R
  - rhdfs - for HDFS file management from within R
  - rhbase - for database management with HBase

Sqoop

Sqoop works to transport data from RDBMS to HDFS
- sqoop import example

 sqoop import \
    --connect jdbc:mysql://localhost \
    --username training --password training123 \
    --warehouse-dir /myHdfs \
    --table titanic

You can use sqoop to send data from HDFS to data base as well

Reading assignment and references

Apache Hadoop web site
http://hadoop.apache.org/
google file system paper
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf
google mapreduce paper
http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf