A parser is a software component that takes input data (frequently text) and builds a data structure —often some kind of parse tree, abstract syntax tree or other hierarchical structure— giving a structural representation of the input, checking for correct syntax in the process
5.2 R package "xml2"
The package "xml2" is designed for one major purpose, namely, to parse XML and HTML content. Remember that HTML is one the countless XML dialects.
As of this writing, "xml2" has minimal functionality for writing content in XML. Hadley Wickham has mentioned that he plans to add more functions for writing XML. So it is possible that in the future, "xml2" integrates more writing-XML functionality. Having said that, we will focus exclusively on reading XML content.
We’ll cover 4 major types of tasks that we can perform with "xml2"
parsing (ie reading) xml / html content
obtaining descriptive information about parsed contents
navigating the tree structure (i.e. accessing its components)
querying and extracting data from parsed contents
5.2.1 Parsing Functions
There are two main parsing functions:
read_xml()
read_html()
For XML files in general, you should use read_xml(). For HTML files, then it’s better to use read_html() because it is more robust, and can handle no well-formed HTML files, which are not uncommon to deal with in practice.
The main input for these reading functions is either a string, an R connection, or a raw vector.
The string can be either a path, a URL or literal xml. URL’s will be converted into connections either using base::url() or, if installed, curl::curl(). Local paths ending in .gz, .bz2, .xz, .zip will be automatically uncompressed.
Both read_xml() and read_html() return an object of class "xml_document".
Let’s see an example. Consider one of the examples from the previous chapter, for instance some content in XML:
For illustration purposes, let’s take the XML content, treating it as a single character string, that we then pass to read_xml():
# toy example with xml stringmovie <-read_xml("<movie><title>Good Will Hunting</title><director><first_name>Gus</first_name><last_name>Van Sant</last_name></director><year>1998</year><genre>drama</genre></movie>")movie
This type of object has an internal structure in order to maintain the hierarchical tree-structure of any XML content.
5.3 Working with parsed documents
Having parsed an XML / HTML document, we can use 2 main functions to start working on the tree structure:
xml_root() gets access to the root node and its elements
xml_children() gets access to the children nodes of a given node
5.3.1 Example with a basic XML document
Here’s some content: a movie elements in XML syntax
XML Movie
The following figure identifies the main nodes:
XML Movie nodes
Below is an abstract representation of an XML file, and its main nodes
Functions of xml2
5.3.2 More Functions in "xml2"
In addition to xml_root() and xml_children(), there are other functions to parse the various kinds of content within a given node.
Here’s a table with the main navigation functions. Keep in mind that the applicability of the functions depends on the class of objects we are working on.
Function
Description
xml_root()
Returns root node
xml_children()
Returns children nodes
xml_child()
Returns specified children number
xml_name()
Returns name of a node
xml_contents()
Returns contents of a node
xml_text()
Returns text
xml_length()
Returns number of children nodes
xml_parents()
Returns set of parent nodes
xml_siblings()
Returns set of sibling nodes
5.3.3 Navigation of XML / HTML Tree
Let’s consider the following XML content:
<movies>
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="106" lang="spa">
<title>Y tu mama tambien</title>
<director>
<first_name>Alfonso</first_name>
<last_name>Cuaron</last_name>
</director>
<year>2001</year>
<genre>drama</genre>
</movie>
</movies>
Theis content can be depicted in the following tree-diagram:
XML movies tree
Let’s create a character vector to store the XML content:
# toy example with xml stringxml_string <-c('<?xml version="1.0" encoding="UTF-8"?>','<movies>','<movie mins="126" lang="eng">','<title>Good Will Hunting</title>','<director>','<first_name>Gus</first_name>','<last_name>Van Sant</last_name>','</director>','<year>1998</year>','<genre>drama</genre>','</movie>','<movie mins="106" lang="spa">','<title>Y tu mama tambien</title>','<director>','<first_name>Alfonso</first_name>','<last_name>Cuaron</last_name>','</director>','<year>2001</year>','<genre>drama</genre>','</movie>','</movies>')
Let’s parse the content. To do this, we must first create a single contiguous xml string, which is done with paste() and its collapse = '' argument:
# parsing xml stringdoc <-read_xml(paste(xml_string, collapse =''))doc
{xml_document}
<movies>
[1] <movie mins="126" lang="eng">\n <title>Good Will Hunting</title>\n <dir ...
[2] <movie mins="106" lang="spa">\n <title>Y tu mama tambien</title>\n <dir ...
And let’s navigate the tree structure. We begin with xml_root() to get access to the root node:
# root nodemovies <-xml_root(doc)movies
{xml_document}
<movies>
[1] <movie mins="126" lang="eng">\n <title>Good Will Hunting</title>\n <dir ...
[2] <movie mins="106" lang="spa">\n <title>Y tu mama tambien</title>\n <dir ...
It turns out that doc and movies are actually identical:
identical(doc, movies)
[1] TRUE
We use the xml_length() to know how many elements or nodes are in the root node:
# parsing xml stringxml_length(doc)
[1] 2
which confirms what we know about the movies string that contains two movie elements: one node for “Good Will Hunting” and another node for “Y tu mama tambien”.
The function xml_children() allows you to access the children nodes:
xml_children(doc)
{xml_nodeset (2)}
[1] <movie mins="126" lang="eng">\n <title>Good Will Hunting</title>\n <dir ...
[2] <movie mins="106" lang="spa">\n <title>Y tu mama tambien</title>\n <dir ...
Notice that the output is an object of class "xml_nodeset". To access a specific node, you use the function xml_child(). In this example, the node for movie “Good Will Hunting” corresponds to the first node, and we pass this value to the search argument: