Mahbubul Majumder, PhD
Oct 16, 2014
It means EXtensible Markup Language
Unlike HTML, it is designed to describe data
Data are self explained in XML format
Data becomes verbose
XML separates data from HTML codes, so one can concentrate on
Both works as a complement of each other
Some data structures can be very messy, but with XML we can control some
Scalable Vector Graphics (SVG) are written in XML
Top node is root and bottom colored nodes are leaves
Each node is an XML tag
Each leaf is data value
<?xml version="1.0"?>
<data>
<person id='1'>
<name>
<first>Mahbubul</first>
<last>Majumder</last>
</name>
<gender>Male</gender>
<address>
<street>6001 Dodge</street>
<city>Omaha</city>
<state>NE</state>
<zip>68182</zip>
</address>
</person>
...
<person id='n'>
...
</person>
</data>
What does this code mean?
<data>
is the root noden
child nodes with some hierarchy of subchild nodes</data>
is end of the root nodeXML tree structure may have any number of hierarchy in child nodes.
<root>
<child>
<subchild>
.....
</subchild>
</child>
</root>
To view tips data visit here
Each triangle on the left will expand the view of the nodes
We notice the tree structure
R
package XML provides functions to explore XML data# install.packages("XML")
library(XML)
R
object. Here we will use tips data from ggobi web sitemyUrl <- 'http://www.ggobi.org/book/data/tips.xml'
myDoc <- xmlParse(myUrl)
rvPath <- "//ggobidata/data/variables/realvariable"
rvList <- xpathApply(myDoc, rvPath, xmlAttrs)
rvNames <- as.vector(unlist(rvList))
rvNames
[1] "obs" "totbill" "tip" "size"
cvPath <- "//ggobidata/data/variables/categoricalvariable"
cvList <- xpathApply(myDoc, cvPath, xmlAttrs)
cvNames <- as.vector(unlist(cvList))
cvNames
[1] "sex" "smoker" "day" "time"
datPath <- "//ggobidata/data/records/record"
datValue <- xpathApply(myDoc, datPath, xmlValue)
datValue <- strsplit(gsub('\\n','',datValue), split=" ")
head(datValue,2)
[[1]]
[1] "1" "16.99" "1.01" "1" "1" "4" "2" "2"
[[2]]
[1] "2" "10.34" "1.66" "2" "1" "4" "2" "3"
datValue
is a list which contains each of the records of tips data. We need to convert it to a data frame. Also, notice how badly column names are arranged !tipDat <- do.call(rbind.data.frame, datValue)
names(tipDat) <- c(rvNames[-4],cvNames,rvNames[4])
head(tipDat)
obs totbill tip sex smoker day time size
1 1 16.99 1.01 1 1 4 2 2
2 2 10.34 1.66 2 1 4 2 3
3 3 21.01 3.5 2 1 4 2 3
4 4 23.68 3.31 2 1 4 2 2
5 5 24.59 3.61 1 1 4 2 4
6 6 25.29 4.71 2 1 4 2 4
variables
, is at depth 2 in the path and 2nd in the list at that depth. Also, notice how that can be accessed directly. r <- xmlRoot(myDoc)
varInfo <- r[[1]][[2]]
varInfo
contains values as below<variables count="8">
<realvariable name="obs"/>
<realvariable name="totbill"/>
<realvariable name="tip"/>
<categoricalvariable name="sex">
...
as.vector(unlist(xmlApply(varInfo, xmlAttrs)))
[1] "obs" "totbill" "tip" "sex" "smoker" "day" "time"
[8] "size"
cPath <- "//categoricalvariable/levels/level"
cLevels <- unlist(xpathApply(varInfo, cPath, xmlValue))
cLevels
[1] "F" "M" "No" "Yes" "Thu" "Fri" "Sat" "Sun"
[9] "Day" "Night"
cPaths <- "//categoricalvariable/levels"
lvLength <- unlist(xpathApply(varInfo, cPaths, xmlSize))
lvLength
[1] 2 2 4 2
k <- 0
for (i in seq(cvNames)){
indx <- 1:lvLength[i] + k
k <- k + lvLength[i]
levels(tipDat[, cvNames[i]]) <- cLevels[indx]
}
head(tipDat)
obs totbill tip sex smoker day time size
1 1 16.99 1.01 F No Sun Night 2
2 2 10.34 1.66 M No Sun Night 3
3 3 21.01 3.5 M No Sun Night 3
4 4 23.68 3.31 M No Sun Night 2
5 5 24.59 3.61 F No Sun Night 4
6 6 25.29 4.71 M No Sun Night 4
*
we indicate whatever the node begetNodeSet(varInfo,'//*/levels')
xmlChildren(varInfo)[1:4]
r <- xmlRoot(myDoc)
## xmlName(r)
## xmlSize(r)
## xmlAttrs(r)
xmlValue(r[[1]][[2]])
[1] "FMNoYesThuFriSatSunDayNight"
? xmlToDataFrame
and ? xmlToList
library(XML)
data <- head(women,4)
xmlDat <- xmlTree()
xmlDat$addTag("women", close=FALSE)
for (i in 1:nrow(data)) {
xmlDat$addTag("vital", close=FALSE)
for (j in names(data)) {
xmlDat$addTag(j, data[i, j])
}
xmlDat$closeTag()
}
xmlDat$closeTag()
myXML <- cat(saveXML(xmlDat))
<?xml version="1.0"?>
<women>
<vital>
<height>58</height>
<weight>115</weight>
</vital>
<vital>
<height>59</height>
<weight>117</weight>
</vital>
<vital>
<height>60</height>
<weight>120</weight>
</vital>
<vital>
<height>61</height>
<weight>123</weight>
</vital>
</women>
It stands for J
avaS
cript O
bject N
otation
Differences with XML
head(women)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
{"women":[
{"height":"58", "weight":"115"},
{"height":"59", "height":"117"},
{"height":"60", "height":"120"},
{"height":"61", "weight":"123"},
{"height":"62", "height":"126"},
{"height":"63", "height":"129"}
]}
<html>
<body>
<h2>Display JSON data</h2>
<p id="p1"></p>
<script>
var myData = '{"name":"Mahbubul Majumder","dept":"Dept. of Mathematics","phone":"402 5542734"}'
var obj = JSON.parse(myData);
document.getElementById("p1").
innerHTML = obj.name + "<br>" +
obj.dept + "<br>" +
obj.phone;
</script>
</body>
</html>
rjson
is created for this# install.packages("rjson")
library(rjson)
jData <- toJSON(head(women))
jData
[1] "{\"height\":[58,59,60,61,62,63],\"weight\":[115,117,120,123,126,129]}"
fromJSON(jData)
$height
[1] 58 59 60 61 62 63
$weight
[1] 115 117 120 123 126 129
XML package information
http://www.omegahat.org/RSXML/shortIntro.pdf
A nice discussion on How to get XML data into R
http://stackoverflow.com/questions/21790059/reading-xml-data-in-r
To learn more about XML and how it works with examples please visit
http://www.w3schools.com/xml/default.asp
To learn about JSON and practice online please visit
http://www.w3schools.com/json/