19 Storage of Data Tables

To illustrate some of the ideas presented in this chapter we’ll keep using the toy example with data from Leia Organa, Luke Skywalker, and Han Solo.

Three individuals

Figure 19.1: Three individuals

If we ask you to think of a table containing the information of Leia, Luke, and Han, you would probably picture something like this:

Conceptual table in the analyst's mind

Figure 19.2: Conceptual table in the analyst’s mind

So the question is: How do we store data in tabular format? What type of file? How do we store “cells” in a file? This issue has to do with the mapping between the notion of data as in the data scientist’s mind, and the notion of data as in the way it’s stored digitally (file format, structure, location, etc).

A table in the analyst's mind, and its mapping to be stored in one or more files

Figure 19.3: A table in the analyst’s mind, and its mapping to be stored in one or more files

19.1 Plain Text Formats

A common way to store data in tabular form is via text files. Confusingly, people may refer to text files in various ways such as:

  • Plain text files

  • Formatted text files

  • Enriched text files

We prefer Norman Matloff’s way of thinking about text files:

“Let’s take the term text file to mean a file that consists mainly of ASCII characters … and that uses newline characters to give humans the perception of lines”

Norman Matloff (The Art of R Programming, 2011)

By text files we mean plain text files, that is, any file that is in a human-readable form. Common file extensions are .txt, .csv, .xml, .html, .md, etc. Moreover, text files stored as a sequence of characters, in which each character is stored as a single byte of data. In addition, text files can be read and edited with a text editor (not to confuse with a word processor).

19.1.1 Plain Text Format

There are 2 main subtypes of plain text formats, depending on how the separated values are identified in a row:

  1. Delimited formats
  2. Fixed-width formats

The key aspect is: How do we convey the notion of rows and columns in a text file?

Regardless of what file option we choose, we need a way to separate data values. In both delimited and fixed-width formats, each line represents a “row”. The differerence is in the way we separate data values. In delimited formats the idea of columns is conveyed with delimiters. In fixed-width formats we use a fixed number of characters for each field or column.

Some of the common delimiters are:

  • " " (white space)
  • "," (comma)
  • "\t" (tab)
  • ";" (semicolon)

Let’s bring back our toy table

name gender height
Leia Organa female 1.50
Luke Skywalker male 1.72
Han Solo male 1.80


Comma demilited files

We can store the data in a text delimiter file using commas as delimiters:

name,gender,height
Leia Organa,female,1.50
Luke Skywalker,male,1.72
Han Solo,male,1.80

What if a comma is part of a field? For example, say column name has the format of last name, comma, and then first name:

name gender height
Organa, Leia female 1.50
Skywalker, Luke male 1.72
Solo, Han male 1.80

In this case, we need to wrap the name field within quotes so that the commas separating the first name from the last name are not treated as delimiters:

name,gender,height
"Organa, Leia",female,1.50
"Skywalker, Luke",male,1.72
"Solo, Han",male,1.80

Semicolon delimited files

We can also store the data in a text delimiter file using semicolons as delimiters, which is a common format used in Europe:

name;gender;height
Leia Organa;female;1.50
Luke Skywalker;male;1.72
Han Solo;male;1.80

Space delimited files

Another common type of delimiter is whitespace. Notice that in our exmple, the name of each invidual contains a whitespace to distinguish the first name from the last name. Because of this, we need to wrap the name within quotes. In that way, any whitesapces inside a string surrounded by quotes won’t be treated as delimiters.

name gender height
"Leia Organa" female 1.50
"Luke Skywalker" male 1.72
"Han Solo" male 1.80

Tab delimited files

Tab delimited files are another typical option to store tables. The tab character is formed by a backslash and the letter t: \t. Keep in mind that most editors don’t display tab as \t, but rather as a blank space. Technically, if our toy table is stored as a TSV file, its content would be:

name\tgender\theight
"Leia Organa"\tfemale\t1.50
"Luke Skywalker"\tmale\t1.72
"Han Solo"\tmale\t1.80

General delimited files

Because delimited files are not formal standards but just conventions used for storing data, nothing stops you from using a different character as delimiter. For example, we could use a vertical bar | as a field separator:

name|gender|height
"Leia Organa"|female|1.50
"Luke Skywalker"|male|1.72
"Han Solo"|male|1.80

Or double colons ::, for example

name::gender::height
"Leia Organa"::female::1.50
"Luke Skywalker"::male::1.72
"Han Solo"::male::1.80

19.1.2 Fixed-Width Format

As the name indicates, in fixed-width formats we convey the idea of columns in a file by using a fixed number of characters for each field.

name           gender height
Leia Organa    female 1.50
Luke Skywalker male   1.72
Han Solo       male   1.80

The first field (name) occupies 15 characters, and then there is a whitespace before the values of the next field. The second field gender takes a width of 6 characters, followed by another whitespace before the next field. The last field height occupies 6 characters (the number of characters in the word height).

19.2 Pros and Cons of Tables

Some advantages:

  • Simplicity
  • Common formats (csv, tsv, txt, dat, etc)
  • Can be opened and modifed with a text editor
  • Can also be opened in spreadsheet software
  • Easy to understand for most users
  • Can be read in data analysis software

Some disadvantages:

  • Not good for hierarchical or nested structures
  • Values may not be well self-described
  • Difficult to include metadata
  • Long column names are cumbersome

19.2.1 Summary Slides

19.3 Exercises

Take one minute to decide which data type is most appropriate for each of the following variables collected in a medical experiment: Subject id, name, treatment, sex, address, race, eye colour, birth city, birth state.