40 Basics of XML and HTML

The goal of this chapter is to give you a crash introduction to XML and HTML, so you can get a good grasp of this format for the next chapter: Web and HTTP.

  • Large amounts of data and information are stored, shared and distributed using HTML and XML-dialects.

  • They are widely adopted and used in many applications.

  • Working with data from the Web means dealing with HTML, which is an XML dialect.

40.1 What is XML?

XML stands for eXtensible Markup Language

Let’s disect the meaning of this acronym. On one hand, XML is a markup language. which means, XML defines a set of rules for encoding information in a format that is both human-readable and machine-readable.

Compared to other types of markup languages (e.g LaTeX, Markdown), XML is used to describe data. To be more precise, XML is a standard for the semantic, hierarchical representation of data. This is an important aspect of XML and any of its dialects, because data is represented following a hierarchy.

For instance, one way to organize data is in a table. Conceptually, all elements are stored in cells of a grid structure of rows and columns. Another way to organize data is with hierarchies, that can be visually represented with tree like structures. This latter form of organizing data is what XML uses.

The second aspect, “extensible”, means that we can define any number of new formats to represent any kind of data. Therefore, it is extensible. This is a very interesting aspect of XML because it provides a flexible framework to create new formats for describing and representing data.

Comments

Before moving on, we want to clarify some key terms.

A markup is a sequence of characters or other symbols inserted at certain places in a document to indicate either:

  • how the content should be displayed when printed or in screen
  • describe the document’s structure

A Markup Language is a system for annotating (i.e. marking) a document in a way that the content is distinguished from its representation (e.g. LaTeX, PostScript, HTML, SVG)

40.1.1 Marks in XML

In XML (as well as in HTML) the marks (also known as tags) are defined using angle brackets: < >.

For example:

<mark>Text marked with special tag</mark>

The concept of extensibility means that we can define our own marks, the order in which they occur, and how they should be processed. For example we could define marks such as:

  • <my_mark>
  • <awesome>
  • <boring>
  • <pathetic>

Before moving on, we should mention that XML is NOT:

  • a programming language
  • a network transfer protocol
  • a database

Instead, XML is:

  • more than a markup language
  • a generic language that provides structure and syntax for representing any type of information
  • a meta-language: it allows us to create or define other languages

Here are some famous examples of XML dialects:

  • KML (Keyhole Markup Language) for describing geo-spatial information used in Google Earth, Google Maps, Google Sky

  • SVG (Scalable Vector Graphics) for visual graphical displays of two-dimensional graphics with support for interactivity and animation

  • PMML (Predictive Model Markup Language) for describing and exchanging models produced by data mining and machine learning algorithms

  • RSS (Rich Site Summary) feeds for publishing blog entries

  • SDMX (Statistical Data and Metadata Exchange) for organizing and exchanging statistical information

  • SBML (Systems Biology Markup Language) for describing biological systems

40.1.2 Minimalist Example

Let’s see an ultra simple XML example:

<movie>
  Good Will Hunting
</movie>
  • one single element movie
  • start-tag: <movie>
  • end-tag: </movie>
  • content: Good Will Hunting

XML elements can have attributes, for example:

<movie mins="126" lang="en">
  Good Will Hunting
</movie>
  • attributes: mins (minutes) and lang (language)
  • attributes are attached to the element’s start tag
  • attribute values must be quoted!

XML elements may contain other elements, for example:

<movie mins="126" lang="en">
  <title>Good Will Hunting</title>
  <director>Gus Van Sant</director>
  <year>1998</year>
  <genre>drama</genre>
</movie>
  • an xml element may contain other elements
  • movie contains several elements: title, director, year, genre

As you can tell, the xml element movie has a now a hierarchy. We can make it more interesting by including more elements inside director.

<movie mins="126" lang="en">
  <title>Good Will Hunting</title>
  <director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
  </director>
  <year>1998</year>
  <genre>drama</genre>
</movie>

Formally, we say that director has two child elements: first_name and last_name.

Let’s consider the following abstract XML example:

<Root>
  <child_1>...</child_1>
  <child_2>...</child_2>
    <subchild>...</subchild>
  <child_3>...</child_3>
</Root>
  • An XML document can be represented with a tree structure
  • An XML document must have one single Root element
  • The Root may contain child elements
  • A child element may contain subchild elements
XML tree structure

Figure 40.1: XML tree structure

40.1.3 Well Formedness

We say that an XML document is well-formed when it obeys the basic syntax rules of XML. Some of those rules are:

  • one root element containing the rest of elements
  • properly nested elements
  • self-closing tags
  • attributes appear in start-tags of elements
  • attribute values must be quoted
  • element names and attribute names are case sensitive

Does it matter if an XML document is not Well-formed? Not well-formed XML documents produce potentially fatal errors or warnings when parsed.

Keep in mind that documents may be well-formed but not valid. Well-formed just guarantees that the document meets the basic XML structure, not that the content is valid.

40.1.4 Additional XML Elements

Some Additional Elements

<?xml version="1.0"? encoding="UTF-8" ?>
<![CDATA[ a > 5 & b < 10 ]]>
<?GS print(format = TRUE)>
<!DOCTYPE Movie>
<!-- This is a commet -->
<movie mins="126" lang="en">
  <title>Good Will Hunting</title>
  <director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
  </director>
  <year>1998</year>
  <genre>drama</genre>
</movie>

The following table lists some of the common additional XML elements:

Markup Name Description
<?xml > XML Declaration Identifies content as an XML document
<?PI > Processing Instruction Processing instructions passed to application PI
<!DOCTYPE > Document-type Declaration Defines the structure of an XML document
<![CDATA[ ]]> CDATA Character Data Anything inside a CDATA is ignored by the parser
<!-- --> Comment For writing comments

40.1.5 Another Example

Let’s go back to the movie example, but now let’s see how the content of our hypothetical XML document should look like:

<?xml version="1.0"?>
<!DOCTYPE movies>
<movie mins="126" lang="en">
  <!-- this is a comment -->
  <title>Good Will Hunting</title>
  <director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
  </director>
  <year>1998</year>
  <genre>drama</genre>
</movie>

Each Node can have

  • a Name
  • any number of attributes
  • optional content
  • other nested elements

40.1.6 Wrapping-Up

About XML

  • designed to store and transfer data
  • designed to be self-descriptive
  • tags are not predefined and can be extended
  • a generic language that provides structure and syntax for many markup dialects
  • is a syntax or format for defining markup languages
  • a standard for the semantic, hierarchical representation of data
  • provides a general approach for representing all types of information dialects

40.2 A quick introduction to HTML

HTML is not a programming language; it is simply a markup language, which means it is a syntax for identifying and describing the elements of a document such as headings, paragraphs, lists, tables, images, hyperlinks, etc. Technically, HTML is an XML dialect.

Say we visit R’s official website (screencapture below).

R project's home page

Figure 40.2: R project’s home page

The visually rich and interactive pages we see on the Web are based on plain text files referred to as source files. To look at the actual HTML content behind R’s homepage, you need to get access to the source code option in your browser. If you are using Chrome, go to the View tab in the menu bar, then choose the Developer option, and finally click on View Source.

View source code of a webpage in Chrome

Figure 40.3: View source code of a webpage in Chrome

If we take a look at the source file behind R’s homepage, we’ll discover the actual HTML content, depicted in the image below.

HTML source code behind R project's home page

Figure 40.4: HTML source code behind R project’s home page

As you can tell, the webpage is cleverly rendered by your browser that knows exactly how to take care of the content in the source file. If you are not familiar with HTML, some (if not most) of the text will look like gibberish to you right now. But it all has a specific structure and meaning.

What you see on the browser is the result of the resources served by the server where R’s website is stored. Technically speaking, the resources should include an index.htmlfile, plus other files (stylesheet files, and image files)

Other resources being linked in the home page

Figure 40.5: Other resources being linked in the home page

In particular, the following resources (different types of files) can be identified:

  • index.html
  • favicon-33x32.png
  • favicon-16x16.png
  • bootstrap.min.css
  • R.css
  • Rlogo.png

40.2.1 HTML document structure

Let’s study the structure of a basic HTML document. Below is a diagram with a simplified content of R’s webpage.

HTML document structure

Figure 40.6: HTML document structure

The first line of text is the document type declaration, which identifies this document as an HTML5 document. Then we have the html element which is the root element of the document, and it contains all the other elements.

Within the html element, we find two elements: the head and the body. The head element contains descriptive information such as the title, style sheets, scripts, and other meta information. The mandatory element inside the head is the title.

The body element contains everything that is displayed in the browser.

40.2.2 HTML Syntax

You don’t need to memorize all possible HTML elements (or tags), but it’s important that you learn about their syntax and structure. So let’s describe the anatomy of html elements.

Here’s an example with a <p> element which is the paragraph element. An HTML tag has an opening tag consisting of the tag name surrounded by angle brackets, that is, the <p> characters.

Usually, you put tags around some content text. At the end of the tag there is the closing tag, in this case </p>. You know it’s a closing tag because it comens after the content, and it has a slash / before the p name. All closing tags have a slash in them.

Anatomy of html elements

Figure 40.7: Anatomy of html elements

Not all tags come in the form of a pair of matching tags (an opening and a closing tag). There are some tags that don’t have a closing tag. Perhaps the most common tag of this type is the <img> tag used for images. One example is the <img> tag for the R logo file in the homepage of R project:

<img src="/Rlogo.png"/>

As you can tell, the <img> tag does not have a closing tag; you can say that itself closes with a slash and the right angle bracket />.

Some elements have attributes which allows you to specify additional information about an element. Attributes are declared inside the opening tag using special keywords. We assign values to attributes with the equals sign, and we specify the values inside quotations.

Attributes and values in html tags

Figure 40.8: Attributes and values in html tags

In the example above, a paragraph tag contains an attribute lang for language with a value of es for español or spanish.

Notice also that the previous <img> element has an attribute src to indicate the source filename of the picture, in this case, "/Rlogo.png".

40.2.3 What the browser does

The browser (e.g. Chrome, Safari, Firefox) reads the HTML, interprets all the tags, and renders the content accordingly. Recall that tags tell browser about the structure and meaning of the text. The browser identifies what parts are headings (e.g. <h1>, <h2>), what parts are paragraphs (e.g. <p>), what parts are lists (e.g. <ol>, <ul>), what text needs to be emphasized, and so on.

The HTML syntax tells the browser about the structure of a document: where the headings are, where the paragraphs are, what text is part of a list, and so on. How do browsers know this? Well, they have built-in default rules for how to render HTML elements. In addition to the default settings, HTML elements can be formatted in endless ways using what is called Cascade Style Sheets or CSS for short, that determine font types, colors, sizes, and many other visual aspects of a page.

40.2.4 Web Scraping

Many websites are secured by an SSL/TSL certificate, which you can identify by looking at the URL containing https (Hyper Text Transfer Protocol Secure). SSL stands for Secure Sockets Layer. This is a technology that keeps an internet connection secure and safeguards sensitive data that is being sent between a client and a server (for example, when you use your browser to shop in amazon) or server to server (for example, an application with payroll information). The SSL technology is currently deprecated and has been replaced entirely by TLS which stands for Transport Layer Security. Simply put, TSL also ensures data privacy the same way that SSL does. Since SSL is actually no longer used, this is the correct term that people should start using.

HTTPS is a secure extension of HTTP. When a website uses HTTPS it means that the website is secured by an SSL/TLS certificate. Consequently, websites that install and configure an SSL/TLS certificate can use the HTTPS protocol to establish a secure connection with the server. Quote: “The details of the certificate, including the issuing authority and the corporate name of the website owner, can be viewed by clicking on the lock symbol on the browser bar.”

Wikipedia uses HTTPS. For instance, if we visit the entry for men’s long jump world record progression, the url is

https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression

If we try to use functions like readHTMLTable from "XML" package, it will fail

wiki <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'

# this fails
tbls <- readHTMLTable(wiki)

One option to read the html tables and extract them as R data frames, is to first download the html file to your computer, and then use readHTMLTable() to scrape the tables:

# desired url
wiki <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'

# destination file
jump_html <- 'men-long-jump-records.html'

# download file to your working directory
download.file(wiki, jump_html)

tbls <- readHTMLTable(jump_html)

We recommend using this option when:

  • the data fits in your computer, in this way you also have the raw data
  • you need to experiment and get to know the content, in order to decide which elements you will extract, which functions to use, what kind of processing operations or transformations you need to apply, etc.
  • also, downloding an HTML document save you from making innecessary requests that could get in trouble, and potentially be blocked by a server because you are overloading them with multiple requests.