Mahbubul Majumder, PhD
Nov 25, 2014
We talked about big data in the beginning of the semester
Volume of the data may exceed the physical capacity
Variety of data does not have any bound
Velocity
Storage and cost are big issues
The Value of data depends on how they are being used
How can this problem be solved?
The limitation of processing power can be overcome at some scale
How this challenge of parallel processing of data can be performed?
__ __ __
/ / / /___ _____/ /___ ____ ____
/ /_/ / __ `/ __ / __ \/ __ \/ __ \
/ __ / /_/ / /_/ / /_/ / /_/ / /_/ /
/_/ /_/\__,_/\__,_/\____/\____/ .___/
/_/
It is a software platform
The core hadoop system runs with
A cluster consists of a set of servers that work together as if a server
Each server in a cluster system is called a node
The picture shows a simple homemade cluster setup taken from wikipedia
Hadoop allows are scalable process - to increase cluster capacity, add more nodes
Fault-tolerant system:
Cost is very low without compromising the performance
Made it possible to handle a massive amount of data conveniently
Facebook uses Apache Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning
Ebay is using it for Search optimization and Research
Twitter uses Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter
HCC (Holland Computing Center), University of Nebraska runs one medium-sized Hadoop cluster (1.6PB) to store and serve up physics data
Yahoo uses Hadoop to support research for Ad Systems and Web Search
To find out a detailed list you may like to visit the Hadoop users wiki
Based on the google paper about google file system
The file system works like Unix file system
But HDFS is not exactly Unix file system
Name node controls everything
Data nodes stores data and run jobs
End user does not have to worry about these
HDFS works like Linux file system
You have to use hadoop specific commands to talk
hadoop fs
hadoop fs -ls myHdfs
Found 1 items
-rw-r--r-- training training 669879113 2014-10-11 23:50 myHdfs/test.csv
Based on 2nd google paper.
map
filters and sort the datareduce
summarize the dataEach job in MapReduce process are processed in datanode
Suppose we want to count number of words in a input file
This example will show details of how a MapReduce process will run
Hadoop splits the jobs into smaller pieces
MapReduce codes in Hadoop are usually written in JAVA
Map function just assigns a number with each word
Mapping
Shuffling by hadoop
Reduce function just add those numbers for each words
Reducer task
A MapReduce program's workflow to count the number of words
titanic = LOAD '/user/training/titanic' AS (pclass, survived);
groups = GROUP titanic BY pclass;
result = FOREACH groups GENERATE group, SUM(titanic.survived) AS t;
DUMP result;
Hive is another high level programming language to work with HDFS
Like pig, Hive interpreter works as a wrapper as well
Hive language is similar to SQL, it is called HiveQL
SELECT pclass, sum(survived) as total_survived
FROM titanic
GROUP BY pclass
ORDER BY pclass;
HBase is the Hadoop database
For HBase there is no SQL like language. So, the language is noSQL
RHadoop
can be found in their github repositoryravro
- read and write files in avro formatplyrmr
- higher level plyr-like data processing for structured datarmr
- for MapReduce functionality in Rrhdfs
- for HDFS file management from within Rrhbase
- for database management with HBase sqoop import \
--connect jdbc:mysql://localhost \
--username training --password training123 \
--warehouse-dir /myHdfs \
--table titanic
Apache Hadoop web site
http://hadoop.apache.org/
google file system paper
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf
google mapreduce paper
http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf