24 Intro to Functions

R comes with many functions (and packages) that let us perform a wide variety of tasks.
Most of the things we do in R is via calling some function.
Sometimes, however, there’s no function to do what we want to achieve.
When that’s the case, you will want to write your own functions.

So far you’ve been using a number of functions in R. Now it’s time to see how you can create and use your own functions.

24.1 Motivation

Consider the data set starwars that comes in the package "dplyr"

starwars
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Leia…    150    49 brown      light      brown           19   female
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru…    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg…    183    84 black      light      brown           24   male  
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Let’s focus on the variable height, more specifically on the first 10 values:

ht10 <- starwars$height[1:10]
ht10
#>  [1] 172 167  96 202 150 178 165  97 183 182

The values of height (and ht10) are expressed in centimeters, but what if we wanted to obtain values in inches? The conversion formula is 1 cm = 0.3937 in.

# height in inches
ht10 * 0.3937
#>  [1] 67.7 65.7 37.8 79.5 59.1 70.1 65.0 38.2 72.0 71.7

This works. But what if you had more data sets, all of them containing height values in cms, and you needed to convert those cms into inches? Wouldn’t be nice to have a dedicated function cm2in()? I know this is a toy example but I would find convenient to have cm2in() because I can never remember the coversion value: 1 cm = 0.3937 in.

cm2in(ht10)

R does not have a built-in function cm2in() but we can create one. Let’s see how to do it “logically” step by step.

24.1.1 Writing a simple function

So, how do you create a function? The first step is to write code and make sure that it works. We recommend that you take a small and concrete example:

# 1) concrete example
ht10 * 0.3937
#>  [1] 67.7 65.7 37.8 79.5 59.1 70.1 65.0 38.2 72.0 71.7

The next step is to make the code more general. Instead of working with the object ht10 that refers to just the first 10 hieght values, we can give it a more algebraic name as x:

# 2) make it more general
x <- ht10
y <- x * 0.3937

This also allows us to identify what is the input or inputs (x in this case), what are the required computations, and what is the output (y in this case).

Then, you encapsulate the code that will make the body of the function within curly braces to form a compound R expression. This is definitely more of an abstract step, but here’s the code:

# 3) encapsulate code with "an R expression"
# i.e. wrapping code around curly braces
{
  y <- x * 0.3937
}

Finally, declare the function with function() and assign it a name. This involves choosing a name for the function, name(s) for the argument(s), and determine the output with a return() statement:

# 4) create function
cm2in <- function(x) {
  y <- x * 0.3937
  return(y)
}

Here are all the steps previously described, plus some more steps after you create cm2in()

# 1) concrete example
ht10 * 0.3937

# 2) make it more general
x <- ht10
y <- x * 0.3937

# 3) encapsulate code with "an R expression"
# i.e. wrapping code around curly braces
{
  y <- x * 0.3937
}

# 4) create function
cm2in <- function(x) {
  y <- x * 0.3937
  return(y)
}

# 5) test it
cm2in(ht10)

# 6) keep testing
cm2in(starwars$height)

If you want to get the conversion of 100 cm to inches, you just simply execute it again by changing its argument:

cm2in(100)
#> [1] 39.4

Notice that the function is vectorized, this is because we are using arithmetic operators (i.e. multiplication, subtraction, division).

Sometimes it is recommended to add a default value to one (or more) of the arguments. In this case, we can give a default value of x = 1. When the user executes the function without any input, cm2in() returns the value of 1 cm to inches:

cm2in <- function(x = 1) {
  y <- x * 0.3937
  return(y)
}

# default execution
cm2in()
#> [1] 0.394

In Summary

To define a new function in R you use the function function().
You need to specify a name for the function, and then assign function() to the chosen name.
You also need to define optional arguments (i.e. inputs of the function).
And of course, you must write the code (i.e. the body) so the function does something when you use it.

24.2 Anatomy of a function

To define a new function in R you use the function function(). You need to specify a name for the function, and then assign function() to the chosen name. You also need to define optional arguments (i.e. inputs). And of course, you must write the code (i.e. the body) so the function does something when you use it:

# anatomy of a function
some_name <- function(arguments) {
  # body of the function
}

Usually, you give a name to a function (although there are also anonymous functions).
A function takes one or more inputs (or none), known as arguments.
The expressions forming the operations comprise the body of the function.
You wrap the body of a function with curly braces.
A function returns a single value (i.e. a single object).

A bit less abstract function could have the following structure:

some_name <- function(arg1, arg2, arg3) {
  expression_1
  expression_2
  ...
  expression_n
}

the name of this hypothetical function is some_name
it uses several arguments: arg1, arg2, and arg3
the body is wrapped within braces {...}
in general, the last expression expression_n would be the returned output

24.2.1 Scale Transformations

Let’s see another example. Often, we need to transform the scale of one or more variables. Perhaps the most common type of transformation is when we standardize a variable, that is: subtract its mean, and divide by its standard deviation:

\[ z = \frac{x - \mu}{\sigma} \]

R has the function scale() that can be used to perform this operation, but let’s pretend for a minute that there’s no function in R to calculate standard scores. Here are the primary steps to compute such score:

compute the mean $\mu$
compute the standard deviation $\sigma$
calculate deviations from the mean $x - \mu$
divide deviations-from-mean by standard deviation $(x - \mu) / \sigma$

x <- ht10                   
x_mean <- mean(x)      # compute the mean
x_sd <- sd(x)          # compute std dev
x_devs <- x - x_mean   # deviations from the mean
z <- x_devs / x_sd     # normalize by std dev
z
#>  [1]  0.358  0.218 -1.770  1.199 -0.258  0.526  0.162 -1.742  0.666  0.638

Having the code of the function’s body, we can encapsulate it with a function assignment:

# first round
standardize <- function(x) {
  x_mean <- mean(x)
  x_sd <- sd(x)
  x_devs <- x - x_mean
  z <- x_devs / x_sd
  return(z)
}

And now we can test it:

standardize(ht10)
#>  [1]  0.358  0.218 -1.770  1.199 -0.258  0.526  0.162 -1.742  0.666  0.638

24.2.2 The `return()` command

As you can tell, the last line in the body of standardize() uses the return() function. More often than not, the return() command is included to explicitly indicate the output of a function:

# simple example
add <- function(x, y) {
  z <- x + y
  return(z)
}

add(2, 3)
#> [1] 5

I’ve seen that many users with previous programming experience in other languages prefer to use print(). The main reason is that other programming languages tend to use some sort of print statement to indicate the output of a function. However, the dedicated function in R to specify the output of a function is return(). You could use print() but I strongly suggest that you use return() instead. The reason is because print() is a generic method in R, which means that print() is not a single function but a family of functions that have different behaviors depending on the class of the object they are printing. So to play safe, stick with return().

24.2.3 More Testing

What about applying standardize() on the entire column height:

standardize(starwars$height)
#>  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [76] NA NA NA NA NA NA NA NA NA NA NA NA

Ooops! Because starwars$height contains missing values, our standardize() function does not know how to deal with them.

24.2.4 Dealing with missing values

How to deal with NA’s? Many functions in R like sum(), mean(), and median() have the so-called na.rm argument to specify whether missing values should be removed before carrying certain computations. If a function has this argument, you just take advantage of it by using na.rm = TRUE:

# second round
standardize <- function(x) {
  x_mean <- mean(x, na.rm = TRUE)
  x_sd <- sd(x, na.rm = TRUE)
  x_devs <- x - x_mean
  z <- x_devs / x_sd
  return(z)
}

standardize(ht10)
#>  [1]  0.358  0.218 -1.770  1.199 -0.258  0.526  0.162 -1.742  0.666  0.638

standardize(starwars$height)
#>  [1] -0.0678 -0.2116 -2.2536  0.7950 -0.7005  0.1047 -0.2691 -2.2248  0.2485
#> [10]  0.2198  0.3923  0.1623  1.5427  0.1623 -0.0391  0.0185 -0.1253  0.1623
#> [19] -3.1164 -0.1253  0.2485  0.7375  0.4499  0.0760  0.0185  0.1623 -0.7005
#> [28]      NA -2.4837 -0.4129  0.5361  0.4786 -0.1253  0.6224  1.4277  0.9100
#> [37]  0.2485 -1.0744 -1.7934  0.2485 -0.3267  0.0185  0.1623  0.1047 -2.3111
#> [46] -1.5058 -0.3267  0.3923  0.6799  0.6224 -0.0966  0.2773  0.3923  2.5781
#> [55]  0.3923  0.6224  0.3061 -0.4992  0.2485  0.2485 -0.1253 -0.2404 -0.2691
#> [64]  0.5361  0.4786  0.2485 -0.1829  0.6799  1.5715  1.1113 -0.2116 -2.7425
#> [73] -2.2536  0.5361  0.4786  0.1047  1.1976  1.7153  0.3923  0.1047  0.9100
#> [82]      NA      NA      NA      NA      NA -0.2691

Now standardize() is able to return a more useful output by removing missing values. However, we should let the user decide if NA’s must be removed. We can include an argument na.rm in standardize() to indicate whether missing values are to be removed:

# third round
standardize <- function(x, na.rm = FALSE) {
  x_mean <- mean(x, na.rm = na.rm)
  x_sd <- sd(x, na.rm = na.rm)
  x_devs <- x - x_mean
  z <- x_devs / x_sd
  return(z)
}

Notice that standardize() uses an argument na.rm that it’s set to FALSE by default. Likewise, we use such argument na.rm to pass it to the homonym arguments of mean() and sd().

# default call
standardize(starwars$height)
#>  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [76] NA NA NA NA NA NA NA NA NA NA NA NA

# removing NAs
standardize(starwars$height, na.rm = TRUE)
#>  [1] -0.0678 -0.2116 -2.2536  0.7950 -0.7005  0.1047 -0.2691 -2.2248  0.2485
#> [10]  0.2198  0.3923  0.1623  1.5427  0.1623 -0.0391  0.0185 -0.1253  0.1623
#> [19] -3.1164 -0.1253  0.2485  0.7375  0.4499  0.0760  0.0185  0.1623 -0.7005
#> [28]      NA -2.4837 -0.4129  0.5361  0.4786 -0.1253  0.6224  1.4277  0.9100
#> [37]  0.2485 -1.0744 -1.7934  0.2485 -0.3267  0.0185  0.1623  0.1047 -2.3111
#> [46] -1.5058 -0.3267  0.3923  0.6799  0.6224 -0.0966  0.2773  0.3923  2.5781
#> [55]  0.3923  0.6224  0.3061 -0.4992  0.2485  0.2485 -0.1253 -0.2404 -0.2691
#> [64]  0.5361  0.4786  0.2485 -0.1829  0.6799  1.5715  1.1113 -0.2116 -2.7425
#> [73] -2.2536  0.5361  0.4786  0.1047  1.1976  1.7153  0.3923  0.1047  0.9100
#> [82]      NA      NA      NA      NA      NA -0.2691

24.2.5 Simplifying the body

So far we have a working function standardize() that does the job and takes care of potential missing values. We can take a further step and review the code of the body. Let’s go back to the initial code:

x <- ht10
x_mean <- mean(x)
x_sd <- sd(x)
x_devs <- x - x_mean
z <- x_devs / x_sd

The code above works, but it is somewhat “verbose”. We can take advantage of R’s functional behavior to shorten the computation of the standard scores in one line:

x <- ht10
z <- (x - mean(x)) / sd(x)
z
#>  [1]  0.358  0.218 -1.770  1.199 -0.258  0.526  0.162 -1.742  0.666  0.638

Having simplified the code, we can simplify our standardize() function:

# fifth round
standardize <- function(x, na.rm = FALSE) {
  z <- (x - mean(x, na.rm = na.rm)) / sd(x, na.rm = na.rm)
  return(z)
}

standardize(tail(starwars$height, n = 10), na.rm = TRUE)
#>  [1]  1.484 -0.231 -0.604  0.440     NA     NA     NA     NA     NA -1.089