4 Basics of XML

The goal of this chapter is to give you a crash introduction to XML so that you can get a good grasp of this format for the rest of the book.

  • Large amounts of data and information are stored, shared and distributed using XML-dialects.

  • They are widely adopted and used in many applications.

  • Working with data from the Web often means dealing with some kind of XML dialect.

4.1 What is XML?

XML stands for eXtensible Markup Language

Let’s dissect the meaning of this acronym. On one hand, XML is a markup language. which means, XML defines a set of rules for encoding information in a format that is both human-readable and machine-readable.

Compared to other types of markup languages (e.g LaTeX, Markdown), XML is used to describe data. To be more precise, XML is a standard for the semantic, hierarchical representation of data. This is an important aspect of XML and any of its dialects, because data is represented following a hierarchy.

For instance, one way to organize data is in a table. Conceptually, all elements are stored in cells of a grid structure of rows and columns. Another way to organize data is with hierarchies, that can be visually represented with tree like structures. This latter form of organizing data is what XML uses.

The second aspect, “extensible”, means that we can define any number of new formats to represent any kind of data. Therefore, it is extensible. This is a very interesting aspect of XML because it provides a flexible framework to create new formats for describing and representing data.

Comments

Before moving on, we want to clarify some key terms.

A markup is a sequence of characters or other symbols inserted at certain places in a document to indicate either:

  • how the content should be displayed when printed or in screen
  • describe the document’s structure

A Markup Language is a system for annotating (i.e. marking) a document in a way that the content is distinguished from its representation (e.g. LaTeX, PostScript, HTML, SVG)

4.1.1 Marks in XML

In XML (as well as in HTML) the marks (also known as tags) are defined using angle brackets: < >.

For example:

<mark>Text marked with special tag</mark>

The concept of extensibility means that we can define our own marks, the order in which they occur, and how they should be processed. For example we could define marks such as:

  • <my_mark>
  • <awesome>
  • <boring>
  • <pathetic>

Before moving on, we should mention that XML is NOT:

  • a programming language
  • a network transfer protocol
  • a database

Instead, XML is:

  • more than a markup language
  • a generic language that provides structure and syntax for representing any type of information
  • a meta-language: it allows us to create or define other languages

Here are some famous examples of XML dialects:

  • KML (Keyhole Markup Language) for describing geo-spatial information used in Google Earth, Google Maps, Google Sky

  • SVG (Scalable Vector Graphics) for visual graphical displays of two-dimensional graphics with support for interactivity and animation

  • PMML (Predictive Model Markup Language) for describing and exchanging models produced by data mining and machine learning algorithms

  • RSS (Rich Site Summary) feeds for publishing blog entries

  • SDMX (Statistical Data and Metadata Exchange) for organizing and exchanging statistical information

  • SBML (Systems Biology Markup Language) for describing biological systems

4.1.2 Minimalist Example

Let’s consider a handful of XML examples using one of my favorite movies: Good Will Hunting, a 1997 American psychological drama film directed by Gus Van Sant, and written by Ben Affleck and Matt Damon.

Good Will Hunting (Directed by Gus Van Sant, 1997)

Figure 4.1: Good Will Hunting (Directed by Gus Van Sant, 1997)

Ultra Simple example

Let’s see an ultra simple XML example:

<movie>
  Good Will Hunting
</movie>
  • one single element movie
  • start-tag: <movie>
  • end-tag: </movie>
  • content: Good Will Hunting

Elements with attributes

XML elements can have attributes, for example:

<movie mins="126" lang="en">
  Good Will Hunting
</movie>
  • attributes: mins (minutes) and lang (language)
  • attributes are attached to the element’s start tag
  • attribute values must be quoted!

Elements within other elements

XML elements may contain other elements, for example:

<movie mins="126" lang="en">
  <title>Good Will Hunting</title>
  <director>Gus Van Sant</director>
  <year>1998</year>
  <genre>drama</genre>
</movie>
  • an xml element may contain other elements

  • movie contains several elements: title, director, year, genre

More Embedded elements

As you can tell, the xml element movie has a now a hierarchy. We can make it more interesting by including more elements inside director.

<movie mins="126" lang="en">
  <title>Good Will Hunting</title>
  <director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
  </director>
  <year>1998</year>
  <genre>drama</genre>
</movie>

Formally, we say that director has two child elements: first_name and last_name.

Tree Structure in XML

We can graphically display the structure of an XML document with a tree diagram, like the following one:

XML tree structure

Figure 4.2: XML tree structure

  • An XML document can be represented with a tree structure

  • An XML document must have one single Root element

  • The Root may contain child elements

  • A child element may contain subchild elements

4.1.3 Well Formedness

We say that an XML document is well-formed when it obeys the basic syntax rules of XML. Some of those rules are:

  • one root element containing the rest of elements
  • properly nested elements
  • self-closing tags
  • attributes appear in start-tags of elements
  • attribute values must be quoted
  • element names and attribute names are case sensitive

Does it matter if an XML document is not Well-formed? Not well-formed XML documents produce potentially fatal errors or warnings when parsed.

Keep in mind that documents may be well-formed but not valid. Well-formed just guarantees that the document meets the basic XML structure, not that the content is valid.

4.1.4 Additional XML Elements

Some Additional Elements

<?xml version="1.0"? encoding="UTF-8" ?>
<![CDATA[ a > 5 & b < 10 ]]>
<?GS print(format = TRUE)>
<!DOCTYPE Movie>
<!-- This is a commet -->
<movie mins="126" lang="en">
  <title>Good Will Hunting</title>
  <director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
  </director>
  <year>1998</year>
  <genre>drama</genre>
</movie>

The following table lists some of the common additional XML elements:

Markup Name Description
<?xml > XML Declaration Identifies content as an XML document
<?PI > Processing Instruction Processing instructions passed to application PI
<!DOCTYPE > Document-type Declaration Defines the structure of an XML document
<![CDATA[ ]]> CDATA Character Data Anything inside a CDATA is ignored by the parser
<!-- --> Comment For writing comments

4.1.5 Another Example

Let’s go back to the movie example, but now let’s see how the content of our hypothetical XML document should look like:

<?xml version="1.0"?>
<!DOCTYPE movies>
<movie mins="126" lang="en">
  <!-- this is a comment -->
  <title>Good Will Hunting</title>
  <director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
  </director>
  <year>1998</year>
  <genre>drama</genre>
</movie>

Each Node can have

  • a Name
  • any number of attributes
  • optional content
  • other nested elements

4.1.6 Wrapping-Up

About XML

  • designed to store and transfer data
  • designed to be self-descriptive
  • tags are not predefined and can be extended
  • a generic language that provides structure and syntax for many markup dialects
  • is a syntax or format for defining markup languages
  • a standard for the semantic, hierarchical representation of data
  • provides a general approach for representing all types of information dialects