41 Parsing XML and HTML

The goal of this chapter is to describe how we can parse XML / HTML content with the R package xml2

You will need the following packages

library(xml2)
library(stringr)

We’ll cover a variety of situations you most likely will find yourself dealing with:

  • R package XML
  • Navigating the xml tree structure
  • Main functions in package XML
  • XPath

41.0.1 What is parsing?

Getting data from the web often involves reading and processing content from XML and HTML documents. This is known as parsing.

The dictionary defines “parse” as:

analyze (a sentence) into its parts and describe their syntactic roles.

In regards to “computing”, parse has to do with:

analyze (a string or text) into logical syntactic components, typically in order to test conformability to a logical grammar.

an act of or the result obtained by parsing a string or a text.

According to Wikipedia, a parser is:

A parser is a software component that takes input data (frequently text) and builds a data structure —often some kind of parse tree, abstract syntax tree or other hierarchical structure— giving a structural representation of the input, checking for correct syntax in the process

41.1 R package "xml2"

The package "xml2" is designed for one major purpose, namely, to parse XML and HTML content. Remember that HTML is one the countless XML dialects.

As of this writing, "xml2" has minimal functionality for writing content in XML. Hadley Wickham has mentioned that he plans to add more functions for writing XML. So it is possible that in the future, "xml2" integrates more writing-XML functionality. Having said that, we will focus exclusively on reading XML content.

We’ll cover 4 major types of tasks that we can perform with "xml2"

  • parsing (ie reading) xml / html content
  • obtaining descriptive information about parsed contents
  • navigating the tree structure (i.e. accessing its components)
  • querying and extracting data from parsed contents

41.1.1 Parsing Functions

There are two main parsing functions:

  • read_xml()

  • read_html()

For XML files in general, you should use read_xml(). For HTML files, then it’s better to use read_html() because it is more robust, and can handle no well-formed HTML files, which are not uncommon to deal with in practice.

The main input for these reading functions is either a string, an R connection, or a raw vector.

The string can be either a path, a URL or literal xml. URL’s will be converted into connections either using base::url() or, if installed, curl::curl(). Local paths ending in .gz, .bz2, .xz, .zip will be automatically uncompressed.

Both read_xml() and read_html() return an object of class "xml_document".

Let’s see an example. Consider one of the examples from the previous chapter, for instance some content in XML:

<movie mins="126" lang="en">
  <title>Good Will Hunting</title>
  <director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
  </director>
  <year>1998</year>
  <genre>drama</genre>
</movie>

For illustration purposes, let’s take the XML content, treating it as a single character string, that we then pass to read_xml():

# toy example with xml string
movie <- read_xml(
"<movie>
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>")

movie
#> {xml_document}
#> <movie>
#> [1] <title>Good Will Hunting</title>
#> [2] <director>\n  <first_name>Gus</first_name>\n  <last_name>Van Sant</last_n ...
#> [3] <year>1998</year>
#> [4] <genre>drama</genre>

As we mention, the movie is an XML object:

class(movie)
#> [1] "xml_document" "xml_node"

This type of objec has an internal structure in order to maintain the hierarchical tree-structure of any XML content.

41.1.2 Working with parsed documents

Having parsed an XML / HTML document, we can use 2 main functions to start working on the tree structure:

  • xml_root() gets access to the root node and its elements

  • xml_children() gets access to the children nodes of a given node

Here’s a table with the main navigation funcitons:

Function Description
xml_root() Returns root node
xml_children() Returns children nodes
xml_child() Returns specified children number
xml_name() Returns name of a node
xml_contents() Returns contents of a node
xml_text() Returns text
xml_length() Returns number of children nodes
xml_parents() Returns set of parent nodes
xml_siblings() Returns set of sibling nodes

Here’s some content: a movie elements in XML syntax

XML Movie

Figure 41.1: XML Movie

The following figure indetifies the main nodes:

XML Movie nodes

Figure 41.2: XML Movie nodes

Below is an abstract representation of an XML file, and its main nodes

Functions of `xml2`

Figure 41.3: Functions of xml2