5 What Do We Mean by Data?

The last section of our introductory part in the book has to do with the notion of Data. When people talk about “Data”, what exactly do they mean?

The term data has become so broad that it is one of those things that means everything and nothing at the same time.

5.1 Ways to Think About Data

We like to consider three different ways to talk about data:

  • How analysts, scientists, practitioners tend to think about data
  • How data is stored, under what format, with what structure
  • How programs and languages handle data (what types and objects they use)

Here’s a conceptual diagram depicting this idea:

Three Views of Data

Figure 5.1: Three Views of Data

You will see that each perspective is unique, with its own challenges and its own idiosyncrasies. An important part of your role as a data scientist is to develop a good mapping between these perspectives.

5.1.1 How do analysts think about data

This is how we think, the abstract or conceptual view of data.

How do data scientist tend to think about data

Figure 5.2: How do data scientist tend to think about data

The data scientist very likely pictures a table in her mind with data values for years and meters; this information is what she needs to produce a timeline chart that allows her to visualize the progression of world records.

What data scientists typically have in their minds are mostly mathematical, and statistical abstractions such as variables, features, covariates, etc. It may involve thinking about scales of measurement (e.g. binary, nominal, ordinal, quantitative); it may also involve putting things in terms of relationships or associations, perhaps theoretical mathematical models that take various forms (e.g. linear, quadratic, non-parametric, etc)

  • Quantitative -vs- Qualitative
  • Continuous -vs- Discrete
  • Numerical -vs- Categorical
  • Scales: Ratio, Interval, Ordinal, Nominal
  • Dependent -vs- Independent
  • Descriptors (predictors) -vs- Response
  • Input -vs- Output
  • Missing values, Censored
  • Correlations
  • Theoretical model
  • Specific type of theoretical model (e.g. linera model)

5.1.2 What about the storage, organization, format of the data?

Another data perspective has to do with the way in which data sets are stored, which includes the file format, the structure, and sometimes the location of such files. In the long jump world records example, we can think of two types of files. On one hand we have the raw HTML file of wikipedia page that contains the HTML table with all world records. On the other hand, we could also have the clean data that can be stored as a CSV after scraping the HTML file.

Because these two files are fairly small, they can easily be stored in your computer or maybe in a flashdrive. But sometimes data sets can be so big that they won’t fit in your computer’s memory. They will have to be stored in remote computers, known as file servers, and you will need a way to communicate with the server in order to access the required data.

Format and Storage

Figure 5.3: Format and Storage

In this book we will make the following assumptions:

  • Data is already in digital form
  • It has already been collected
  • It is already stored in some files/directories
  • No worries about transcribing data, or setting up a data base

5.1.3 How do programming languages handle data?

The third data perspective has to do with the way programming languages and software handle data.

At the end of the day, a program needs to provide some mechanism not only to import a data set, but also to organize the content of such data in a way that we can do computations on it.

Data Objects

Figure 5.4: Data Objects

In general, programming languages offer two types or levels for handling data:

  • Data Types
  • Data Structures

Data types are the simplest building blocks (integer, real, logical, character). Think of these as the atoms or elementary molecules.

Data structures, also known as data objects, are the containers for several data types. If we think of data types as atoms, then data objects would be the complex molecules.

Programs use a variety of objects for storing data. Among the common names you will find out there are:

  • lists
  • arrays
  • sets
  • tables
  • dictionaries

Depending on which program you are using, you will find some type of data container. One generic way to think of data containers is in terms of their dimensions, or some other properties.

  • One dimensional objects
  • Two dimensional objects
  • Multidimensional objects

In this book we’ll focus on those objects available in R: vectors, factors, arrays (which involves matrices), lists, and data frames.

5.1.4 Diagram: 3 views of data

As you can tell, simply talking about “data” just like that without being more specific is too lose. To summarize, data takes three states: 1) in the mind of the data scientist, 2) in the files in whcih they are stored, and 3) in the way programs and languages allow you to interact with the data via data types and data objects.

Three Views of Data

Figure 5.5: Three Views of Data