40 Basics of XML and HTML
The goal of this chapter is to give you a crash introduction to XML and HTML, so you can get a good grasp of this format for the next chapter: Web and HTTP.
Large amounts of data and information are stored, shared and distributed using HTML and XML-dialects.
They are widely adopted and used in many applications.
Working with data from the Web means dealing with HTML, which is an XML dialect.
40.1 What is XML?
XML stands for eXtensible Markup Language
Let’s disect the meaning of this acronym. On one hand, XML is a markup language. which means, XML defines a set of rules for encoding information in a format that is both human-readable and machine-readable.
Compared to other types of markup languages (e.g LaTeX, Markdown), XML is used to describe data. To be more precise, XML is a standard for the semantic, hierarchical representation of data. This is an important aspect of XML and any of its dialects, because data is represented following a hierarchy.
For instance, one way to organize data is in a table. Conceptually, all elements are stored in cells of a grid structure of rows and columns. Another way to organize data is with hierarchies, that can be visually represented with tree like structures. This latter form of organizing data is what XML uses.
The second aspect, “extensible”, means that we can define any number of new formats to represent any kind of data. Therefore, it is extensible. This is a very interesting aspect of XML because it provides a flexible framework to create new formats for describing and representing data.
40.1.1 Marks in XML
In XML (as well as in HTML) the marks (also known as tags) are defined using
angle brackets: < >
.
For example:
<mark>Text marked with special tag</mark>
The concept of extensibility means that we can define our own marks, the order in which they occur, and how they should be processed. For example we could define marks such as:
<my_mark>
<awesome>
<boring>
<pathetic>
Before moving on, we should mention that XML is NOT:
- a programming language
- a network transfer protocol
- a database
Instead, XML is:
- more than a markup language
- a generic language that provides structure and syntax for representing any type of information
- a meta-language: it allows us to create or define other languages
Here are some famous examples of XML dialects:
KML (Keyhole Markup Language) for describing geo-spatial information used in Google Earth, Google Maps, Google Sky
SVG (Scalable Vector Graphics) for visual graphical displays of two-dimensional graphics with support for interactivity and animation
PMML (Predictive Model Markup Language) for describing and exchanging models produced by data mining and machine learning algorithms
RSS (Rich Site Summary) feeds for publishing blog entries
SDMX (Statistical Data and Metadata Exchange) for organizing and exchanging statistical information
SBML (Systems Biology Markup Language) for describing biological systems
40.1.2 Minimalist Example
Let’s see an ultra simple XML example:
<movie>
Good Will Hunting
</movie>
- one single element movie
- start-tag:
<movie>
- end-tag:
</movie>
- content:
Good Will Hunting
XML elements can have attributes, for example:
<movie mins="126" lang="en">
Good Will Hunting
</movie>
- attributes:
mins
(minutes) andlang
(language) - attributes are attached to the element’s start tag
- attribute values must be quoted!
XML elements may contain other elements, for example:
<movie mins="126" lang="en">
<title>Good Will Hunting</title>
<director>Gus Van Sant</director>
<year>1998</year>
<genre>drama</genre>
</movie>
- an xml element may contain other elements
- movie contains several elements: title, director, year, genre
As you can tell, the xml element movie has a now a hierarchy. We can make it more interesting by including more elements inside director.
<movie mins="126" lang="en">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
Formally, we say that director has two child elements: first_name
and
last_name
.
Let’s consider the following abstract XML example:
<Root>
<child_1>...</child_1>
<child_2>...</child_2>
<subchild>...</subchild>
<child_3>...</child_3>
</Root>
- An XML document can be represented with a tree structure
- An XML document must have one single Root element
- The
Root
may containchild
elements - A
child
element may containsubchild
elements
40.1.3 Well Formedness
We say that an XML document is well-formed when it obeys the basic syntax rules of XML. Some of those rules are:
- one root element containing the rest of elements
- properly nested elements
- self-closing tags
- attributes appear in start-tags of elements
- attribute values must be quoted
- element names and attribute names are case sensitive
Does it matter if an XML document is not Well-formed? Not well-formed XML documents produce potentially fatal errors or warnings when parsed.
Keep in mind that documents may be well-formed but not valid. Well-formed just guarantees that the document meets the basic XML structure, not that the content is valid.
40.1.4 Additional XML Elements
Some Additional Elements
<?xml version="1.0"? encoding="UTF-8" ?>
<![CDATA[ a > 5 & b < 10 ]]>
<?GS print(format = TRUE)>
<!DOCTYPE Movie>
<!-- This is a commet -->
<movie mins="126" lang="en">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
The following table lists some of the common additional XML elements:
Markup | Name | Description |
---|---|---|
<?xml > |
XML Declaration | Identifies content as an XML document |
<?PI > |
Processing Instruction | Processing instructions passed to application PI |
<!DOCTYPE > |
Document-type Declaration | Defines the structure of an XML document |
<![CDATA[ ]]> |
CDATA Character Data | Anything inside a CDATA is ignored by the parser |
<!-- --> |
Comment | For writing comments |
40.1.5 Another Example
Let’s go back to the movie example, but now let’s see how the content of our hypothetical XML document should look like:
<?xml version="1.0"?>
<!DOCTYPE movies>
<movie mins="126" lang="en">
<!-- this is a comment -->
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
Each Node can have
- a Name
- any number of attributes
- optional content
- other nested elements
40.1.6 Wrapping-Up
About XML
- designed to store and transfer data
- designed to be self-descriptive
- tags are not predefined and can be extended
- a generic language that provides structure and syntax for many markup dialects
- is a syntax or format for defining markup languages
- a standard for the semantic, hierarchical representation of data
- provides a general approach for representing all types of information dialects
40.2 A quick introduction to HTML
HTML is not a programming language; it is simply a markup language, which means it is a syntax for identifying and describing the elements of a document such as headings, paragraphs, lists, tables, images, hyperlinks, etc. Technically, HTML is an XML dialect.
Say we visit R’s official website (screencapture below).
The visually rich and interactive pages we see on the Web are based on plain text files referred to as source files. To look at the actual HTML content behind R’s homepage, you need to get access to the source code option in your browser. If you are using Chrome, go to the View tab in the menu bar, then choose the Developer option, and finally click on View Source.
If we take a look at the source file behind R’s homepage, we’ll discover the actual HTML content, depicted in the image below.
As you can tell, the webpage is cleverly rendered by your browser that knows exactly how to take care of the content in the source file. If you are not familiar with HTML, some (if not most) of the text will look like gibberish to you right now. But it all has a specific structure and meaning.
What you see on the browser is the result of the resources served by the server
where R’s website is stored. Technically speaking, the resources should include
an index.html
file, plus other files (stylesheet files, and image files)
In particular, the following resources (different types of files) can be identified:
index.html
favicon-33x32.png
favicon-16x16.png
bootstrap.min.css
R.css
Rlogo.png
40.2.1 HTML document structure
Let’s study the structure of a basic HTML document. Below is a diagram with a simplified content of R’s webpage.
The first line of text is the document type declaration, which identifies this document as an HTML5 document. Then we have the html element which is the root element of the document, and it contains all the other elements.
Within the html element, we find two elements: the head and the body. The head element contains descriptive information such as the title, style sheets, scripts, and other meta information. The mandatory element inside the head is the title.
The body element contains everything that is displayed in the browser.
40.2.2 HTML Syntax
You don’t need to memorize all possible HTML elements (or tags), but it’s important that you learn about their syntax and structure. So let’s describe the anatomy of html elements.
Here’s an example with a <p>
element which is the paragraph element.
An HTML tag has an opening tag consisting of the tag name surrounded by angle
brackets, that is, the <p>
characters.
Usually, you put tags around some content text. At the end of the tag there
is the closing tag, in this case </p>
. You know it’s a closing tag because
it comens after the content, and it has a slash /
before the p
name. All
closing tags have a slash in them.
Not all tags come in the form of a pair of matching tags (an opening and a
closing tag). There are some tags that don’t have a closing tag. Perhaps the
most common tag of this type is the <img>
tag used for images. One example
is the <img>
tag for the R logo file in the homepage of R project:
<img src="/Rlogo.png"/>
As you can tell, the <img>
tag does not have a closing tag; you can say
that itself closes with a slash and the right angle bracket />
.
Some elements have attributes which allows you to specify additional information about an element. Attributes are declared inside the opening tag using special keywords. We assign values to attributes with the equals sign, and we specify the values inside quotations.
In the example above, a paragraph tag contains an attribute lang
for
language with a value of es
for español or spanish.
Notice also that the previous <img>
element has an attribute src
to indicate the source filename of the picture, in this case, "/Rlogo.png"
.
40.2.3 What the browser does
The browser (e.g. Chrome, Safari, Firefox) reads the HTML, interprets all the
tags, and renders the content accordingly. Recall that tags tell browser about
the structure and meaning of the text. The browser identifies what parts are
headings (e.g. <h1>
, <h2>
), what parts are paragraphs (e.g. <p>
),
what parts are lists (e.g. <ol>
, <ul>
), what text needs to be emphasized,
and so on.
The HTML syntax tells the browser about the structure of a document: where the headings are, where the paragraphs are, what text is part of a list, and so on. How do browsers know this? Well, they have built-in default rules for how to render HTML elements. In addition to the default settings, HTML elements can be formatted in endless ways using what is called Cascade Style Sheets or CSS for short, that determine font types, colors, sizes, and many other visual aspects of a page.
40.2.4 Web Scraping
Many websites are secured by an SSL/TSL certificate, which you can identify by
looking at the URL containing https
(Hyper Text Transfer Protocol Secure).
SSL stands for Secure Sockets Layer. This is a technology that
keeps an internet connection secure and safeguards sensitive data that is being
sent between a client and a server (for example, when you use your browser
to shop in amazon) or server to server (for example, an application with
payroll information). The SSL technology is currently deprecated and has been
replaced entirely by TLS which stands for Transport Layer Security. Simply
put, TSL also ensures data privacy the same way that SSL does. Since SSL is
actually no longer used, this is the correct term that people should start using.
HTTPS is a secure extension of HTTP. When a website uses HTTPS it means that the website is secured by an SSL/TLS certificate. Consequently, websites that install and configure an SSL/TLS certificate can use the HTTPS protocol to establish a secure connection with the server. Quote: “The details of the certificate, including the issuing authority and the corporate name of the website owner, can be viewed by clicking on the lock symbol on the browser bar.”
Wikipedia uses HTTPS. For instance, if we visit the entry for men’s long jump world record progression, the url is
https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression
If we try to use functions like readHTMLTable
from "XML"
package, it will
fail
wiki <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'
# this fails
tbls <- readHTMLTable(wiki)
One option to read the html tables and extract them as R data frames, is to
first download the html file to your computer, and then use readHTMLTable()
to scrape the tables:
# desired url
wiki <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'
# destination file
jump_html <- 'men-long-jump-records.html'
# download file to your working directory
download.file(wiki, jump_html)
tbls <- readHTMLTable(jump_html)
We recommend using this option when:
- the data fits in your computer, in this way you also have the raw data
- you need to experiment and get to know the content, in order to decide which elements you will extract, which functions to use, what kind of processing operations or transformations you need to apply, etc.
- also, downloding an HTML document save you from making innecessary requests that could get in trouble, and potentially be blocked by a server because you are overloading them with multiple requests.
Comments
Before moving on, we want to clarify some key terms.
A markup is a sequence of characters or other symbols inserted at certain places in a document to indicate either:
A Markup Language is a system for annotating (i.e. marking) a document in a way that the content is distinguished from its representation (e.g. LaTeX, PostScript, HTML, SVG)