4 Basics of XML
The goal of this chapter is to give you a crash introduction to XML so that you can get a good grasp of this format for the rest of the book.
Large amounts of data and information are stored, shared and distributed using XML-dialects.
They are widely adopted and used in many applications.
Working with data from the Web often means dealing with some kind of XML dialect.
4.1 What is XML?
XML stands for eXtensible Markup Language
Let’s dissect the meaning of this acronym. On one hand, XML is a markup language. which means, XML defines a set of rules for encoding information in a format that is both human-readable and machine-readable.
Compared to other types of markup languages (e.g LaTeX, Markdown), XML is used to describe data. To be more precise, XML is a standard for the semantic, hierarchical representation of data. This is an important aspect of XML and any of its dialects, because data is represented following a hierarchy.
For instance, one way to organize data is in a table. Conceptually, all elements are stored in cells of a grid structure of rows and columns. Another way to organize data is with hierarchies, that can be visually represented with tree like structures. This latter form of organizing data is what XML uses.
The second aspect, “extensible”, means that we can define any number of new formats to represent any kind of data. Therefore, it is extensible. This is a very interesting aspect of XML because it provides a flexible framework to create new formats for describing and representing data.
4.1.1 Marks in XML
In XML (as well as in HTML) the marks (also known as tags) are defined using
angle brackets: < >
.
For example:
<mark>Text marked with special tag</mark>
The concept of extensibility means that we can define our own marks, the order in which they occur, and how they should be processed. For example we could define marks such as:
<my_mark>
<awesome>
<boring>
<pathetic>
Before moving on, we should mention that XML is NOT:
- a programming language
- a network transfer protocol
- a database
Instead, XML is:
- more than a markup language
- a generic language that provides structure and syntax for representing any type of information
- a meta-language: it allows us to create or define other languages
Here are some famous examples of XML dialects:
KML (Keyhole Markup Language) for describing geo-spatial information used in Google Earth, Google Maps, Google Sky
SVG (Scalable Vector Graphics) for visual graphical displays of two-dimensional graphics with support for interactivity and animation
PMML (Predictive Model Markup Language) for describing and exchanging models produced by data mining and machine learning algorithms
RSS (Rich Site Summary) feeds for publishing blog entries
SDMX (Statistical Data and Metadata Exchange) for organizing and exchanging statistical information
SBML (Systems Biology Markup Language) for describing biological systems
4.1.2 Minimalist Example
Let’s consider a handful of XML examples using one of my favorite movies: Good Will Hunting, a 1997 American psychological drama film directed by Gus Van Sant, and written by Ben Affleck and Matt Damon.

Figure 4.1: Good Will Hunting (Directed by Gus Van Sant, 1997)
Ultra Simple example
Let’s see an ultra simple XML example:
<movie>
Good Will Hunting
</movie>
- one single element movie
- start-tag:
<movie>
- end-tag:
</movie>
- content:
Good Will Hunting
Elements with attributes
XML elements can have attributes, for example:
<movie mins="126" lang="en">
Good Will Hunting
</movie>
- attributes:
mins
(minutes) andlang
(language) - attributes are attached to the element’s start tag
- attribute values must be quoted!
Elements within other elements
XML elements may contain other elements, for example:
<movie mins="126" lang="en">
<title>Good Will Hunting</title>
<director>Gus Van Sant</director>
<year>1998</year>
<genre>drama</genre>
</movie>
an xml element may contain other elements
movie contains several elements: title, director, year, genre
More Embedded elements
As you can tell, the xml element movie has a now a hierarchy. We can make it more interesting by including more elements inside director.
<movie mins="126" lang="en">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
Formally, we say that director has two child elements: first_name
and
last_name
.
Tree Structure in XML
We can graphically display the structure of an XML document with a tree diagram, like the following one:

Figure 4.2: XML tree structure
An XML document can be represented with a tree structure
An XML document must have one single Root element
The
Root
may containchild
elementsA
child
element may containsubchild
elements
4.1.3 Well Formedness
We say that an XML document is well-formed when it obeys the basic syntax rules of XML. Some of those rules are:
- one root element containing the rest of elements
- properly nested elements
- self-closing tags
- attributes appear in start-tags of elements
- attribute values must be quoted
- element names and attribute names are case sensitive
Does it matter if an XML document is not Well-formed? Not well-formed XML documents produce potentially fatal errors or warnings when parsed.
Keep in mind that documents may be well-formed but not valid. Well-formed just guarantees that the document meets the basic XML structure, not that the content is valid.
4.1.4 Additional XML Elements
Some Additional Elements
<?xml version="1.0"? encoding="UTF-8" ?>
<![CDATA[ a > 5 & b < 10 ]]>
<?GS print(format = TRUE)>
<!DOCTYPE Movie>
<!-- This is a commet -->
<movie mins="126" lang="en">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
The following table lists some of the common additional XML elements:
Markup | Name | Description |
---|---|---|
<?xml > |
XML Declaration | Identifies content as an XML document |
<?PI > |
Processing Instruction | Processing instructions passed to application PI |
<!DOCTYPE > |
Document-type Declaration | Defines the structure of an XML document |
<![CDATA[ ]]> |
CDATA Character Data | Anything inside a CDATA is ignored by the parser |
<!-- --> |
Comment | For writing comments |
4.1.5 Another Example
Let’s go back to the movie example, but now let’s see how the content of our hypothetical XML document should look like:
<?xml version="1.0"?>
<!DOCTYPE movies>
<movie mins="126" lang="en">
<!-- this is a comment -->
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
Each Node can have
- a Name
- any number of attributes
- optional content
- other nested elements
4.1.6 Wrapping-Up
About XML
- designed to store and transfer data
- designed to be self-descriptive
- tags are not predefined and can be extended
- a generic language that provides structure and syntax for many markup dialects
- is a syntax or format for defining markup languages
- a standard for the semantic, hierarchical representation of data
- provides a general approach for representing all types of information dialects
Comments
Before moving on, we want to clarify some key terms.
A markup is a sequence of characters or other symbols inserted at certain places in a document to indicate either:
A Markup Language is a system for annotating (i.e. marking) a document in a way that the content is distinguished from its representation (e.g. LaTeX, PostScript, HTML, SVG)