Big data technology

Mahbubul Majumder, PhD
Nov 25, 2014

Big data issues

  • We talked about big data in the beginning of the semester

    • 3V is most commonly used characterization for big data challenge
  • Volume of the data may exceed the physical capacity

    • Google receives over 2 million search queries every minute
    • transactional data or sensor data are being stored every fraction of seconds
  • Variety of data does not have any bound

    • YouTube, Facebook generate video, audio, image and text data
    • Over 200 million emails are sent every minute
  • Velocity

    • Experiments at CERN generate colossal amounts of data.
      • Particles collide 600 million times per second.
      • Their Data Center processes about one petabyte of data every day.

Challenges with big data

  • Storage and cost are big issues

    • 99.999 percent of all the data are thrown away by CERN hadron collider
      • even with the rest of the data they don't have the capacity to store
      • they built a worldwide grid system to store and process the data
  • The Value of data depends on how they are being used

    • we only have finite time to process
    • processing power is not unbounded
  • How can this problem be solved?

Parallel computing

  • The limitation of processing power can be overcome at some scale

    • if a process takes a year to run, we can use 365 processors to run it in a day
      • the challenge is to be able to parallelize the task
      • another challenge is the failure of one or more processors
      • we need some sort of automation in this process
  • How this challenge of parallel processing of data can be performed?

What is Hadoop?

    __  __          __
   / / / /___ _____/ /___  ____  ____
  / /_/ / __ `/ __  / __ \/ __ \/ __ \
 / __  / /_/ / /_/ / /_/ / /_/ / /_/ /
/_/ /_/\__,_/\__,_/\____/\____/ .___/
                             /_/
  • It is a software platform

    • allows us easily write and run data related applications
    • facilitates processing and manipulating massive amount of data
    • the processes are conveniently scalable
  • The core hadoop system runs with

    • Hadoop Distributed File System (HDFS)
    • MapReduce: a system for parallel processing of massive amount of data

Scalability of processes

  • A cluster consists of a set of servers that work together as if a server

    • servers usually connected through Local Area Network (LAN)
    • they are controlled and scheduled by software
  • Each server in a cluster system is called a node

    • often nodes are configured with same hardware and software (usually Linux)
    • nodes perform two jobs; store and process the data
    • provides the scalability in both storage and process
  • The picture shows a simple homemade cluster setup taken from wikipedia


Why Hadoop

  • Hadoop allows are scalable process - to increase cluster capacity, add more nodes

  • Fault-tolerant system:

    • includes redundant nodes, automatically restarts failed jobs
    • each node processes a small amount of data
    • files are replicated on different nodes
  • Cost is very low without compromising the performance

  • Made it possible to handle a massive amount of data conveniently

Who uses Hadoop

  • Facebook uses Apache Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning

  • Ebay is using it for Search optimization and Research

  • Twitter uses Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter

  • HCC (Holland Computing Center), University of Nebraska runs one medium-sized Hadoop cluster (1.6PB) to store and serve up physics data

  • Yahoo uses Hadoop to support research for Ad Systems and Web Search

  • To find out a detailed list you may like to visit the Hadoop users wiki

HDFS

  • Based on the google paper about google file system

    • A scalable distributed file system for large distributed data-intensive applications
    • provides fault tolerance while running on inexpensive commodity hardware
    • delivers high aggregate performance
  • The file system works like Unix file system

    • paths can be found similar way (such as /myHdfs/testFile.txt)
    • ownership and permission of the files are similar
  • But HDFS is not exactly Unix file system

    • files can't be modified once written
    • to work on files, hadoop specific commands need to be used

How HDFS works

  • Name node controls everything

    • called Master daemon
    • contains namespace meta data
    • monitors the slave nodes or data nodes
  • Data nodes stores data and run jobs

    • called slave daemon
    • each data set is divided into some blocks
    • blocks of data are stored such that failure of a datanode does not damage the data
  • End user does not have to worry about these

    • access files as if one file
    • use hadoop specific commands

HDFS and Linux

  • HDFS works like Linux file system

    • Linux directory structure such as (myHdfs/myFile.txt)
    • you can view the list of hdfs files as if they are Linux files
  • You have to use hadoop specific commands to talk

    • usually any hadoop command should start with hadoop fs
    • for example to list all the files in a hdfs folder myHdfs would be
      • hadoop fs -ls myHdfs
Found 1 items
-rw-r--r-- training training 669879113 2014-10-11 23:50 myHdfs/test.csv

MapReduce

  • Based on 2nd google paper.

    • each job is splitted into many small jobs
    • There are two procedures
      • map filters and sort the data
      • reduce summarize the data
      • reduce is not necessary, you can have map only process
    • this facilitates scalability and parallelization
  • Each job in MapReduce process are processed in datanode

    • jobs are simple and nodes perform similar jobs
    • when combined together, operation could be powerful and even complex
    • it is necessary to write MapReduce program with great care

MapReduce example

  • Suppose we want to count number of words in a input file

  • This example will show details of how a MapReduce process will run

Hadoop and MapReduce

  • Hadoop splits the jobs into smaller pieces

    • Number of jobs depends on the file size
  • MapReduce codes in Hadoop are usually written in JAVA

    • but you can use R
    • or you can use any language with hadoop streaming

Map function





Map function just assigns a number with each word

Mapping

Shuffle and sorting





  • Hadoop automatically shuffles and sorts the map output
  • If necessary hadoop will merge the output

Shuffling by hadoop

Reduce function





Reduce function just add those numbers for each words

Reducer task

MapReduce example

A MapReduce program's workflow to count the number of words

Pig

  • Pig is a high level programming language to work under HDFS
    • it is an alternative to low level MapReduce programs
    • works like dplyr language
titanic = LOAD '/user/training/titanic' AS (pclass, survived);
 groups = GROUP titanic BY pclass;
 result = FOREACH groups GENERATE group, SUM(titanic.survived) AS t;
DUMP result;
  • Pig interpreter works as a wrapper
    • translates the user commands into MapReduce jobs
    • sends the instructions to the cluster

Hive

  • Hive is another high level programming language to work with HDFS

    • it also bases on MapReduce
  • Like pig, Hive interpreter works as a wrapper as well

    • translates the user commands into MapReduce jobs
    • sends the instructions to the cluster
  • Hive language is similar to SQL, it is called HiveQL

  SELECT pclass, sum(survived) as total_survived
    FROM titanic
GROUP BY pclass
ORDER BY pclass;

HBase

  • HBase is the Hadoop database

    • can handle a table of size even petabyte
    • tables can have thousands of columns
  • For HBase there is no SQL like language. So, the language is noSQL

    • uses some sort of API type connections to interact with data

Impala

  • Impala is another high level programming language to work with HDFS and HBase
    • it does not rely on MapReduce tasks, has its own technology
    • it is maintained by Cloudera
    • Cloudera claims that it is 10 times faster than Hive or any MapReduce2 jobs

R and Hadoop

  • You can use R to work under HDFS or HBase
    • more information about RHadoop can be found in their github repository
    • you need to install several packages
      • ravro - read and write files in avro format
      • plyrmr - higher level plyr-like data processing for structured data
      • rmr - for MapReduce functionality in R
      • rhdfs - for HDFS file management from within R
      • rhbase - for database management with HBase

Sqoop

  • Sqoop works to transport data from RDBMS to HDFS
    • sqoop import example
 sqoop import \
    --connect jdbc:mysql://localhost \
    --username training --password training123 \
    --warehouse-dir /myHdfs \
    --table titanic
  • You can use sqoop to send data from HDFS to data base as well

Reading assignment and references