Manipulating XML and JSON data

Mahbubul Majumder, PhD
Oct 16, 2014

What is XML?

It means EXtensible Markup Language
- uses tags like HTML
- but not HTML tags, instead user defined tags
Unlike HTML, it is designed to describe data
Data are self explained in XML format
- software independent
- hardware independent
- very useful for transforming data in various platform
- not too difficult to learn
Data becomes verbose
- for good reason though

How HTML works with XML?

XML separates data from HTML codes, so one can concentrate on
- HTML style or CSS to effectively display data
- XML data are stored in a different place and HTML can easily display
Both works as a complement of each other
- HTML helps display data
- XML makes data easily transferable to any platform
- XML also uses tags and attributes to describe data
Some data structures can be very messy, but with XML we can control some
- The usual table form just can't do it
Scalable Vector Graphics (SVG) are written in XML
- some of the figures you see on the web are not pictures
- they are SVG and extremely scalable

XML data tree

Top node is root and bottom colored nodes are leaves
Each node is an XML tag
Each leaf is data value

xml-tree

XML data structure

The tags are user defined, not HTML tags

<?xml version="1.0"?>
<data>
 <person id='1'>
  <name>
    <first>Mahbubul</first>
    <last>Majumder</last>
  </name>
  <gender>Male</gender>
  <address>
    <street>6001 Dodge</street>
    <city>Omaha</city>
    <state>NE</state>
    <zip>68182</zip>
  </address>
 </person>
  ...
 <person id='n'>
  ...
 </person>
</data>

What does this code mean?
- first line indicates xml version
- <data> is the root node
- there are n child nodes with some hierarchy of subchild nodes
- each child has an attribute id
- </data> is end of the root node
XML tree structure may have any number of hierarchy in child nodes.

<root>
  <child>
    <subchild>
       .....
    </subchild>
  </child>
</root>

Tips data in XML format

To view tips data visit here

Each triangle on the left will expand the view of the nodes

We notice the tree structure

tips-data-xml

Reading XML data in R

R package XML provides functions to explore XML data

# install.packages("XML")
library(XML)

Very first step to work with XML file is to examine the file and view the structure
Then parse the XML data to R object. Here we will use tips data from ggobi web site

myUrl <- 'http://www.ggobi.org/book/data/tips.xml'
myDoc <- xmlParse(myUrl)

Notice that there are two types of variables, real and categorical variables. First we will extract the real variable names.

rvPath <- "//ggobidata/data/variables/realvariable"
rvList <- xpathApply(myDoc, rvPath, xmlAttrs)
rvNames <- as.vector(unlist(rvList))
rvNames

[1] "obs"     "totbill" "tip"     "size"

Reading XML data in R cont..

Now we will extract the categorical variable names.

cvPath <- "//ggobidata/data/variables/categoricalvariable"
cvList <- xpathApply(myDoc, cvPath, xmlAttrs)
cvNames <- as.vector(unlist(cvList))
cvNames

[1] "sex"    "smoker" "day"    "time"

Extracting data. Notice these are not attributes but values. Also split uses two spaces.

datPath <- "//ggobidata/data/records/record"
datValue <- xpathApply(myDoc, datPath, xmlValue)
datValue <- strsplit(gsub('\\n','',datValue), split="  ")
head(datValue,2)

[[1]]
[1] "1"     "16.99" "1.01"  "1"     "1"     "4"     "2"     "2"    

[[2]]
[1] "2"     "10.34" "1.66"  "2"     "1"     "4"     "2"     "3"

Reading XML data in R cont..

datValue is a list which contains each of the records of tips data. We need to convert it to a data frame. Also, notice how badly column names are arranged !

tipDat <- do.call(rbind.data.frame, datValue)
names(tipDat) <- c(rvNames[-4],cvNames,rvNames[4])
head(tipDat)

  obs totbill  tip sex smoker day time size
1   1   16.99 1.01   1      1   4    2    2
2   2   10.34 1.66   2      1   4    2    3
3   3   21.01  3.5   2      1   4    2    3
4   4   23.68 3.31   2      1   4    2    2
5   5   24.59 3.61   1      1   4    2    4
6   6   25.29 4.71   2      1   4    2    4

XML data structure could widely vary. The purpose of this presentation is to demonstrate how XML data could be explored. There could be smarter way of doing the same thing we did. But this gives us the insight of XML data structure.

Getting column names of tips data

Notice that the node, variables, is at depth 2 in the path and 2nd in the list at that depth. Also, notice how that can be accessed directly.

r <- xmlRoot(myDoc)
varInfo <- r[[1]][[2]]

varInfo contains values as below

<variables count="8">
  <realvariable name="obs"/>
  <realvariable name="totbill"/>
  <realvariable name="tip"/>
  <categoricalvariable name="sex">
  ...

We need to obtain all the name attributes which will be the column names

as.vector(unlist(xmlApply(varInfo, xmlAttrs)))

[1] "obs"     "totbill" "tip"     "sex"     "smoker"  "day"     "time"   
[8] "size"

Levels of categorical variables

cPath <- "//categoricalvariable/levels/level"
cLevels <- unlist(xpathApply(varInfo, cPath, xmlValue))
cLevels

 [1] "F"     "M"     "No"    "Yes"   "Thu"   "Fri"   "Sat"   "Sun"  
 [9] "Day"   "Night"

cPaths <- "//categoricalvariable/levels"
lvLength <- unlist(xpathApply(varInfo, cPaths, xmlSize))
lvLength

[1] 2 2 4 2

This means first 2 levels are for first variable (sex), next 2 for 2nd variable(smoker), next 4 are for third variable(day) and final 2 levels are for the last categorical variable(time).

Updating the levels of categorical variables

k <- 0
for (i in seq(cvNames)){
  indx <-  1:lvLength[i] + k
  k <- k + lvLength[i]
  levels(tipDat[, cvNames[i]]) <- cLevels[indx]
  }

head(tipDat)

  obs totbill  tip sex smoker day  time size
1   1   16.99 1.01   F     No Sun Night    2
2   2   10.34 1.66   M     No Sun Night    3
3   3   21.01  3.5   M     No Sun Night    3
4   4   23.68 3.31   M     No Sun Night    2
5   5   24.59 3.61   F     No Sun Night    4
6   6   25.29 4.71   M     No Sun Night    4

Homework
How can you do the same task without a loop?

Other useful functions in XML package

To get all the node sets under a node. By * we indicate whatever the node be

getNodeSet(varInfo,'//*/levels')

To access some children node of a node

xmlChildren(varInfo)[1:4]

What are these commands doing?

r <- xmlRoot(myDoc)
## xmlName(r)
## xmlSize(r)
## xmlAttrs(r)
xmlValue(r[[1]][[2]])

[1] "FMNoYesThuFriSatSunDayNight"

Please ? xmlToDataFrame and ? xmlToList

Saving data into XML format?

library(XML)
data <- head(women,4)

xmlDat <- xmlTree()
xmlDat$addTag("women", close=FALSE)
for (i in 1:nrow(data)) {
    xmlDat$addTag("vital", close=FALSE)
    for (j in names(data)) {
        xmlDat$addTag(j, data[i, j])
    }
    xmlDat$closeTag()
}
xmlDat$closeTag()

To view the data we use cat command

myXML <- cat(saveXML(xmlDat))

<?xml version="1.0"?>

<women>
  <vital>
    <height>58</height>
    <weight>115</weight>
  </vital>
  <vital>
    <height>59</height>
    <weight>117</weight>
  </vital>
  <vital>
    <height>60</height>
    <weight>120</weight>
  </vital>
  <vital>
    <height>61</height>
    <weight>123</weight>
  </vital>
</women>

What is JSON?

It stands for JavaScript Object Notation
- main goal is to store and access data for the web
- can be an alternative to XML
- much easier to understand
- very flexible to work with JavaScript
- it is also language independent and self describing data format
Differences with XML
- it does not use tags
- simpler than XML and easier to handle

JSON data notation

head(women)

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

JSON format could be

{"women":[
    {"height":"58", "weight":"115"}, 
    {"height":"59", "height":"117"},
    {"height":"60", "height":"120"},
    {"height":"61", "weight":"123"}, 
    {"height":"62", "height":"126"},
    {"height":"63", "height":"129"}    
]}

Displaying JSON data using JavaScript

<html>
<body>

<h2>Display JSON data</h2>

<p id="p1"></p>

<script>
var myData = '{"name":"Mahbubul Majumder","dept":"Dept. of Mathematics","phone":"402 5542734"}'

var obj = JSON.parse(myData);

document.getElementById("p1").
  innerHTML = obj.name + "<br>" + 
  obj.dept + "<br>" +
  obj.phone;
</script>

</body>
</html>

Display JSON data

Getting JSON data into R

The package rjson is created for this

# install.packages("rjson")
library(rjson)

jData <- toJSON(head(women))
jData

[1] "{\"height\":[58,59,60,61,62,63],\"weight\":[115,117,120,123,126,129]}"

fromJSON(jData)

$height
[1] 58 59 60 61 62 63

$weight
[1] 115 117 120 123 126 129

Reading assignment and references

XML package information
http://www.omegahat.org/RSXML/shortIntro.pdf
A nice discussion on How to get XML data into R
http://stackoverflow.com/questions/21790059/reading-xml-data-in-r
To learn more about XML and how it works with examples please visit
http://www.w3schools.com/xml/default.asp
To learn about JSON and practice online please visit
http://www.w3schools.com/json/