19 Storage of Data Tables
To illustrate some of the ideas presented in this chapter we’ll keep using the toy example with data from Leia Organa, Luke Skywalker, and Han Solo.
If we ask you to think of a table containing the information of Leia, Luke, and Han, you would probably picture something like this:
So the question is: How do we store data in tabular format? What type of file? How do we store “cells” in a file? This issue has to do with the mapping between the notion of data as in the data scientist’s mind, and the notion of data as in the way it’s stored digitally (file format, structure, location, etc).
19.1 Plain Text Formats
A common way to store data in tabular form is via text files. Confusingly, people may refer to text files in various ways such as:
Plain text files
Formatted text files
Enriched text files
We prefer Norman Matloff’s way of thinking about text files:
“Let’s take the term text file to mean a file that consists mainly of ASCII characters … and that uses newline characters to give humans the perception of lines”
Norman Matloff (The Art of R Programming, 2011)
By text files we mean plain text files, that is, any file that is in a human-readable form. Common file extensions are .txt
, .csv
, .xml
, .html
,
.md
, etc. Moreover, text files stored as a sequence of characters, in which
each character is stored as a single byte of data. In addition, text files can
be read and edited with a text editor (not to confuse with a word processor).
19.1.1 Plain Text Format
There are 2 main subtypes of plain text formats, depending on how the separated values are identified in a row:
- Delimited formats
- Fixed-width formats
The key aspect is: How do we convey the notion of rows and columns in a text file?
Regardless of what file option we choose, we need a way to separate data values. In both delimited and fixed-width formats, each line represents a “row”. The differerence is in the way we separate data values. In delimited formats the idea of columns is conveyed with delimiters. In fixed-width formats we use a fixed number of characters for each field or column.
Some of the common delimiters are:
" "
(white space)","
(comma)"\t"
(tab)";"
(semicolon)
Let’s bring back our toy table
name | gender | height |
---|---|---|
Leia Organa | female | 1.50 |
Luke Skywalker | male | 1.72 |
Han Solo | male | 1.80 |
Comma demilited files
We can store the data in a text delimiter file using commas as delimiters:
name,gender,height
Leia Organa,female,1.50
Luke Skywalker,male,1.72
Han Solo,male,1.80
What if a comma is part of a field? For example, say column name has the format of last name, comma, and then first name:
name | gender | height |
---|---|---|
Organa, Leia | female | 1.50 |
Skywalker, Luke | male | 1.72 |
Solo, Han | male | 1.80 |
In this case, we need to wrap the name field within quotes so that the commas separating the first name from the last name are not treated as delimiters:
name,gender,height
"Organa, Leia",female,1.50
"Skywalker, Luke",male,1.72
"Solo, Han",male,1.80
Semicolon delimited files
We can also store the data in a text delimiter file using semicolons as delimiters, which is a common format used in Europe:
name;gender;height
Leia Organa;female;1.50
Luke Skywalker;male;1.72
Han Solo;male;1.80
Space delimited files
Another common type of delimiter is whitespace. Notice that in our exmple, the name of each invidual contains a whitespace to distinguish the first name from the last name. Because of this, we need to wrap the name within quotes. In that way, any whitesapces inside a string surrounded by quotes won’t be treated as delimiters.
name gender height
"Leia Organa" female 1.50
"Luke Skywalker" male 1.72
"Han Solo" male 1.80
Tab delimited files
Tab delimited files are another typical option to store tables. The tab
character is formed by a backslash and the letter t: \t
. Keep in mind that
most editors don’t display tab as \t
, but rather as a blank space. Technically,
if our toy table is stored as a TSV file, its content would be:
name\tgender\theight
"Leia Organa"\tfemale\t1.50
"Luke Skywalker"\tmale\t1.72
"Han Solo"\tmale\t1.80
General delimited files
Because delimited files are not formal standards but just conventions used
for storing data, nothing stops you from using a different character as
delimiter. For example, we could use a vertical bar |
as a field separator:
name|gender|height
"Leia Organa"|female|1.50
"Luke Skywalker"|male|1.72
"Han Solo"|male|1.80
Or double colons ::
, for example
name::gender::height
"Leia Organa"::female::1.50
"Luke Skywalker"::male::1.72
"Han Solo"::male::1.80
19.1.2 Fixed-Width Format
As the name indicates, in fixed-width formats we convey the idea of columns in a file by using a fixed number of characters for each field.
name gender height
Leia Organa female 1.50
Luke Skywalker male 1.72
Han Solo male 1.80
The first field (name) occupies 15 characters, and then there is a whitespace before the values of the next field. The second field gender takes a width of 6 characters, followed by another whitespace before the next field. The last field height occupies 6 characters (the number of characters in the word height).
19.2 Pros and Cons of Tables
Some advantages:
- Simplicity
- Common formats (csv, tsv, txt, dat, etc)
- Can be opened and modifed with a text editor
- Can also be opened in spreadsheet software
- Easy to understand for most users
- Can be read in data analysis software
Some disadvantages:
- Not good for hierarchical or nested structures
- Values may not be well self-described
- Difficult to include metadata
- Long column names are cumbersome
19.2.1 Summary Slides
19.3 Exercises
Take one minute to decide which data type is most appropriate for each of the following variables collected in a medical experiment: Subject id, name, treatment, sex, address, race, eye colour, birth city, birth state.