Working in Data Science Lab

Mahbubul Majumder, PhD
Nov 13, 2014

Computing platform for data scientists

  • What is an ideal computing environment for a data scientist?

    • is it windows or non-windows?
    • how much are you prepared for that environment?
  • High performance computing platform is often non-windows

    • think of super computing system at PKI
    • provides tons of convenience to deal with data, specially when unstructured
    • open source platforms allow integration of countless scholarly efforts

Computing environment at data science lab

  • We have windows workstations

    • each machine is equipped with R, RStudio and MySQL
    • you can work on MySQL from R and dplyr
  • You can connect to high performance super computing facilities at PKI

    • use putty and double means of authentication is required
    • MySQL, R and hadoop are available for use
    • hadoop and mapreduce will run using at least 10 nodes
    • you need to register to get access
  • For convenience, we will use virtual machines to learn working on non-windows platform

    • windows users have VMwayer Player
    • datascienceVM is a Linux virtual machine with R, MySQL and Hadoop installed
    • hadoop and mapreduce will run using one single node

Linux distribution information

  • Common Linux distributions
    Ubantu , CentOS , Red Hat Linux

  • To learn which distribution you are working use the following command

cat /etc/issue
CentOS release 6.2 (Final)
Kernel \r on an \m
  • You are working as which user?
who
mmajumder console  Nov 11 19:07 
mmajumder ttys000  Nov 13 17:01 

Linux Virtual Machine

  • Each machine in the lab has a Linux virtual machine (LVM)

    • R, MySQL and Hadoop is installed
    • you can copy the machine and run it on your home windows machine
    • windows user will need VMware Player to run. For MAC use Virtual Box
  • Always shut down the virtual machine (VM) properly

    • do not click cross to close the VM window
    • it has a specific button on the right corner to shut it down
    • your programs may crash if not shut down properly
    • you can save your work and give the updated VM to someone else
  • For all applications in the datascienceVM

    • user: training
    • password: training123

Common Linux commands

commands functions examples
ls lists the files and directory in the current location ls -l or ls myFileName
pwd displays the path of current working directory pwd and press enter
cd change the directory cd.. to go back or cd myFolder
mkdir create a directory mkdir newFolder
rm remove a file. Be cautious, it can’t be undone rm myTempFile
vi open a file in the text editor. vi newFile.txt
cat view the content of a file without opening it cat myFile.txt
cp copy file or folder to a different destination cp sourceFile destinationPath
ps display currently running processes ps -a
man displays the help about a command man cp
  • Example
pwd
/Users/mmajumder/Box Sync/Teaching/stat4410-8416-Data-Science/lectures/21-datascience-lab

Creating presentation using R

library(knitr)
  • Rmarkdown syntax is used

    • title of each slide
    • bullet points
    • double column
    • adding R codes or even python or linux codes
    • include pictures and tables
    • add math equations
  • The complete source code of this presentation slide can be found on github repository
    https://github.com/mamajumder/html-presentation

Reading assignment and references