24 Intro to Functions
- R comes with many functions (and packages) that let us perform a wide variety of tasks.
- Most of the things we do in R is via calling some function.
- Sometimes, however, there’s no function to do what we want to achieve.
- When that’s the case, you will want to write your own functions.
So far you’ve been using a number of functions in R. Now it’s time to see how you can create and use your own functions.
24.1 Motivation
Consider the data set starwars
that comes in the package "dplyr"
starwars
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> 5 Leia… 150 49 brown light brown 19 female
#> 6 Owen… 178 120 brown, gr… light blue 52 male
#> 7 Beru… 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg… 183 84 black light brown 24 male
#> 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
Let’s focus on the variable height
, more specifically on the first 10 values:
The values of height
(and ht10
) are expressed in centimeters, but what if we
wanted to obtain values in inches? The conversion formula is 1 cm = 0.3937 in.
This works. But what if you had more data sets, all of them containing height
values in cms, and you needed to convert those cms into inches? Wouldn’t be nice
to have a dedicated function cm2in()
? I know this is a toy example but I would
find convenient to have cm2in()
because I can never remember the coversion
value: 1 cm = 0.3937 in.
R does not have a built-in function cm2in()
but we can create one. Let’s see
how to do it “logically” step by step.
24.1.1 Writing a simple function
So, how do you create a function? The first step is to write code and make sure that it works. We recommend that you take a small and concrete example:
The next step is to make the code more general. Instead of working with the
object ht10
that refers to just the first 10 hieght values, we can give it
a more algebraic name as x
:
This also allows us to identify what is the input or inputs (x
in this case),
what are the required computations, and what is the output (y
in this case).
Then, you encapsulate the code that will make the body of the function within curly braces to form a compound R expression. This is definitely more of an abstract step, but here’s the code:
# 3) encapsulate code with "an R expression"
# i.e. wrapping code around curly braces
{
y <- x * 0.3937
}
Finally, declare the function with function()
and assign it a name. This
involves choosing a name for the function, name(s) for the argument(s), and determine the output with a return()
statement:
Here are all the steps previously described, plus some more steps after you
create cm2in()
# 1) concrete example
ht10 * 0.3937
# 2) make it more general
x <- ht10
y <- x * 0.3937
# 3) encapsulate code with "an R expression"
# i.e. wrapping code around curly braces
{
y <- x * 0.3937
}
# 4) create function
cm2in <- function(x) {
y <- x * 0.3937
return(y)
}
# 5) test it
cm2in(ht10)
# 6) keep testing
cm2in(starwars$height)
If you want to get the conversion of 100 cm to inches, you just simply execute it again by changing its argument:
Notice that the function is vectorized, this is because we are using arithmetic operators (i.e. multiplication, subtraction, division).
Sometimes it is recommended to add a default value to one (or more) of the
arguments. In this case, we can give a default value of x = 1
. When the
user executes the function without any input, cm2in()
returns
the value of 1 cm to inches:
In Summary
To define a new function in R you use the function
function()
.You need to specify a name for the function, and then assign
function()
to the chosen name.You also need to define optional arguments (i.e. inputs of the function).
And of course, you must write the code (i.e. the body) so the function does something when you use it.
24.2 Anatomy of a function
To define a new function in R you use the function function()
.
You need to specify a name for the function, and then assign function()
to the chosen name. You also need to define optional arguments (i.e. inputs).
And of course, you must write the code (i.e. the body) so the function does
something when you use it:
- Usually, you give a name to a function (although there are also anonymous functions).
- A function takes one or more inputs (or none), known as arguments.
- The expressions forming the operations comprise the body of the function.
- You wrap the body of a function with curly braces.
- A function returns a single value (i.e. a single object).
A bit less abstract function could have the following structure:
- the name of this hypothetical function is
some_name
- it uses several arguments:
arg1
,arg2
, andarg3
- the body is wrapped within braces
{...}
- in general, the last expression
expression_n
would be the returned output
24.2.1 Scale Transformations
Let’s see another example. Often, we need to transform the scale of one or more variables. Perhaps the most common type of transformation is when we standardize a variable, that is: subtract its mean, and divide by its standard deviation:
\[ z = \frac{x - \mu}{\sigma} \]
R has the function scale()
that can be used to perform this operation, but
let’s pretend for a minute that there’s no function in R to calculate standard
scores. Here are the primary steps to compute such score:
- compute the mean \(\mu\)
- compute the standard deviation \(\sigma\)
- calculate deviations from the mean \(x - \mu\)
- divide deviations-from-mean by standard deviation \((x - \mu) / \sigma\)
x <- ht10
x_mean <- mean(x) # compute the mean
x_sd <- sd(x) # compute std dev
x_devs <- x - x_mean # deviations from the mean
z <- x_devs / x_sd # normalize by std dev
z
#> [1] 0.358 0.218 -1.770 1.199 -0.258 0.526 0.162 -1.742 0.666 0.638
Having the code of the function’s body, we can encapsulate it with a function assignment:
# first round
standardize <- function(x) {
x_mean <- mean(x)
x_sd <- sd(x)
x_devs <- x - x_mean
z <- x_devs / x_sd
return(z)
}
And now we can test it:
24.2.2 The return()
command
As you can tell, the last line in the body of standardize()
uses the return()
function. More often than not, the return()
command is included to explicitly
indicate the output of a function:
I’ve seen that many users with previous programming experience in other languages
prefer to use print()
. The main reason is that other programming languages
tend to use some sort of print
statement to indicate the output of a function.
However, the dedicated function in R to specify the output of a function
is return()
. You could use print()
but I strongly suggest that you use
return()
instead. The reason is because print()
is a generic method in R,
which means that print()
is not a single function but a family of functions
that have different behaviors depending on the class of the object they are
printing. So to play safe, stick with return()
.
24.2.3 More Testing
What about applying standardize()
on the entire column height
:
standardize(starwars$height)
#> [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [76] NA NA NA NA NA NA NA NA NA NA NA NA
Ooops! Because starwars$height
contains missing values, our standardize()
function does not know how to deal with them.
24.2.4 Dealing with missing values
How to deal with NA
’s? Many functions in R like sum()
, mean()
, and
median()
have the so-called na.rm
argument to specify whether missing values
should be removed before carrying certain computations. If a function has this
argument, you just take advantage of it by using na.rm = TRUE
:
# second round
standardize <- function(x) {
x_mean <- mean(x, na.rm = TRUE)
x_sd <- sd(x, na.rm = TRUE)
x_devs <- x - x_mean
z <- x_devs / x_sd
return(z)
}
standardize(ht10)
#> [1] 0.358 0.218 -1.770 1.199 -0.258 0.526 0.162 -1.742 0.666 0.638
standardize(starwars$height)
#> [1] -0.0678 -0.2116 -2.2536 0.7950 -0.7005 0.1047 -0.2691 -2.2248 0.2485
#> [10] 0.2198 0.3923 0.1623 1.5427 0.1623 -0.0391 0.0185 -0.1253 0.1623
#> [19] -3.1164 -0.1253 0.2485 0.7375 0.4499 0.0760 0.0185 0.1623 -0.7005
#> [28] NA -2.4837 -0.4129 0.5361 0.4786 -0.1253 0.6224 1.4277 0.9100
#> [37] 0.2485 -1.0744 -1.7934 0.2485 -0.3267 0.0185 0.1623 0.1047 -2.3111
#> [46] -1.5058 -0.3267 0.3923 0.6799 0.6224 -0.0966 0.2773 0.3923 2.5781
#> [55] 0.3923 0.6224 0.3061 -0.4992 0.2485 0.2485 -0.1253 -0.2404 -0.2691
#> [64] 0.5361 0.4786 0.2485 -0.1829 0.6799 1.5715 1.1113 -0.2116 -2.7425
#> [73] -2.2536 0.5361 0.4786 0.1047 1.1976 1.7153 0.3923 0.1047 0.9100
#> [82] NA NA NA NA NA -0.2691
Now standardize()
is able to return a more useful output by removing missing
values. However, we should let the user decide if NA
’s must be removed. We
can include an argument na.rm
in standardize()
to indicate whether missing
values are to be removed:
# third round
standardize <- function(x, na.rm = FALSE) {
x_mean <- mean(x, na.rm = na.rm)
x_sd <- sd(x, na.rm = na.rm)
x_devs <- x - x_mean
z <- x_devs / x_sd
return(z)
}
Notice that standardize()
uses an argument na.rm
that it’s set to FALSE
by default. Likewise, we use such argument na.rm
to pass it to the
homonym arguments of mean()
and sd()
.
# default call
standardize(starwars$height)
#> [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [76] NA NA NA NA NA NA NA NA NA NA NA NA
# removing NAs
standardize(starwars$height, na.rm = TRUE)
#> [1] -0.0678 -0.2116 -2.2536 0.7950 -0.7005 0.1047 -0.2691 -2.2248 0.2485
#> [10] 0.2198 0.3923 0.1623 1.5427 0.1623 -0.0391 0.0185 -0.1253 0.1623
#> [19] -3.1164 -0.1253 0.2485 0.7375 0.4499 0.0760 0.0185 0.1623 -0.7005
#> [28] NA -2.4837 -0.4129 0.5361 0.4786 -0.1253 0.6224 1.4277 0.9100
#> [37] 0.2485 -1.0744 -1.7934 0.2485 -0.3267 0.0185 0.1623 0.1047 -2.3111
#> [46] -1.5058 -0.3267 0.3923 0.6799 0.6224 -0.0966 0.2773 0.3923 2.5781
#> [55] 0.3923 0.6224 0.3061 -0.4992 0.2485 0.2485 -0.1253 -0.2404 -0.2691
#> [64] 0.5361 0.4786 0.2485 -0.1829 0.6799 1.5715 1.1113 -0.2116 -2.7425
#> [73] -2.2536 0.5361 0.4786 0.1047 1.1976 1.7153 0.3923 0.1047 0.9100
#> [82] NA NA NA NA NA -0.2691
24.2.5 Simplifying the body
So far we have a working function standardize()
that does the job and takes
care of potential missing values. We can take a further step and review the
code of the body. Let’s go back to the initial code:
The code above works, but it is somewhat “verbose”. We can take advantage of R’s functional behavior to shorten the computation of the standard scores in one line:
x <- ht10
z <- (x - mean(x)) / sd(x)
z
#> [1] 0.358 0.218 -1.770 1.199 -0.258 0.526 0.162 -1.742 0.666 0.638
Having simplified the code, we can simplify our standardize()
function: