1 Introduction

At its heart, computing involves working with numbers. That’s the main reason why computers were invented: to facilitate mathematical operations around numbers; from basic arithmetic to more complex operations (e.g. trigonometry, algebra, calculus, etc.) Nowadays, however, we use computers to work with data that are not just numbers. We use them to write a variety of documents, we use them to create and edit images and videos, to manipulate sound, among many other tasks. Learning to manipulate those data types is fundamental to programming.

Today, there is a considerable amount of information and data in the form of text. Look at any website: pretty much the contents are text and images, with some videos here and there, and maybe some tables and/or a list of numbers.

Likewise, most of the times you are going to be working with text files: script files, reports, data files, source code files, etc. All the R script files that you use are essentially plain text files. I bet you have a csv file or any other field delimited format (or even in HTML, XML, JSON, etc), with some fields containing characters. In all of these cases what you are working with is essentially a bunch of characters.

At the end of the day all the data that is passed to the computer is converted to binary format (zeros and ones) so computers can process it. But no one can deny the fact that a lot of what we do with computers is working with text and character strings.

And then inside R you also have text. Things like row names and column names of matrices, data frames, tables, and any other rectangular data structure. Lists and vectors may also contain names. And what about the text in graphics? Things like titles, subtitles, axis labels, legends, colors, displayed text in a plot, etc. Text is omnipresent: whether you are we are surrounded by it.

1.1 About this book

This book aims to help you get started with handling strings in R. It provides an overview of several resources that you can use for string manipulation. It covers useful functions, general topics, common operations, and other tricks.

This book is NOT about textual data analysis, linguistic analysis, text mining, or natural language processing (NLP). For those purposes, I highly recommend taking a look at the CRAN Task View on Natural Language Processing (NLP):

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

However, even if you don’t plan to do text analysis, text mining, or natural language processing, I bet you have some dataset that contains data as characters: names of rows, names of columns, dates, monetary quantities, longitude and latitude, etc. I’m sure that you have encountered one or more of the following cases:

  • You want to remove a given character in the names of your variables
  • You want to replace a given character in your data
  • Maybe you wanted to convert labels to upper case (or lower case)
  • You’ve been struggling with xml (or html) files
  • You’ve been modifying text files in excel changing labels, categories, one cell at a time, or doing one thousand copy-paste operations

Hopefully after reading this book, you will have the necessary tools in your toolbox for dealing with these (and many) other situations that involve handling and processing strings in R.

1.1.1 Structure of the book

The content of the book is divided in five major parts:

  1. Getting started with character strings
  2. Formatting and printing text and numbers
  3. Input and output
  4. Basic string manipulations
  5. Working with Regular Expressions

If you have minimal or none experience with R, the best place to start is Chapter 2: Characters. If you are already familiar with the basics of vectors and character objects, you can quickly skim this chapter, or skip it, and then go to another chapter of your interest.

Chapter 2 describes different ways to format text and numbers. These are useful tools for when you want to produce output that will either be displayed on screen, or that will be exported to a file.

The third major component of the book has to do with reading in information from text files, as well as exporting output to text to files.

The fourth part of the book deals with basic string manipulations. By “basic” I mean any type of manipulation and transformation that does not require the use of regular expressions.

The fifth part comprises working with regular expressions. Here you will learn about the basic concepts around regular expressions (regex), the intricacies when working with regex in R, and becoming familiar with the regex functions in the R package stringr.

Last but not least, the last chapters of the book present a couple of case studies and extended practical examples that cover the main topics covered in the book.

1.1.2 Main Resources

Documentation on how to manipulate strings and text data in R is very scarce. This is mostly because R is not perceived as a scripting language (like Python or Java, among others). However, I seriously think that we need to have more available resources about this indispensable topic.

There is not much documentation on how to manipulate character strings and text data in R. There are great R books for an enormous variety of statistical methods, graphics and data visualization, as well as applications in a wide range of fields such as ecology, genetics, psychology, finance, economics, etc. But not for manipulating strings and text data.

Perhaps the main reason for this lack of resources is that R is not considered to be qualified as a “scripting” language: R is primarily perceived as a language for computing and programming with (mostly numeric) data. Quoting Hadley Wickham

“R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R”

Most introductory books about R have small sections that briefly cover string manipulation without going further down. That is why I don’t have many books for recommendation, if anything the book by Phil Spector Data Manipulation with R.

If published material is not abundant, we still have the online world. The good news is that the web is full of hundreds of references about processing character strings. The bad news is that they are very spread and uncategorized.

For specific topics and tasks, a good place to start with is Stack Overflow. This is a questions-and-answers site for programmers that has a lot of questions related with R. Just look for those questions tagged with "r": http://stackoverflow.com/questions/tagged/r. There is a good number of posts related with handling characters and text, and they can give you a hint on how to solve a particular problem. There is also R-bloggers, http://www.r-bloggers.com, a blog aggregator for R enthusiasts in which is also possible to find contributed material about processing strings as well as text data analysis.

You can also check the following resources that have to do with string manipulations. It is a very short list of resources but I’ve found them very useful:

1.1.3 Acknowledgements

This book is a major iteration on a previous ebook that I wrote in 2013: Handling and Processing Strings in R. As you can tell, I’ve shorten the title to just Handling Strings with R. I’ve also expanded the content to include many more examples, code snippets, and material about regular expressions.

As always, many thanks to Jessica who patiently accepted my occupying of the dinning table as my workbench (that’s over now).

1.1.4 Colophon

The source of the book is available in the github repository:

https://github.com/gastonstat/r4strings.

The book is powered by https://bookdown.org.

This book was built with:

#> R version 3.3.3 (2017-03-06)
#> Platform: x86_64-apple-darwin13.4.0 (64-bit)
#> Running under: OS X Yosemite 10.10.5
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] backports_1.1.2 magrittr_1.5    bookdown_0.7    rprojroot_1.3-2
#>  [5] htmltools_0.3.6 tools_3.3.3     rstudioapi_0.7  yaml_2.1.16    
#>  [9] Rcpp_0.12.15    stringi_1.1.6   rmarkdown_1.8   knitr_1.18     
#> [13] methods_3.3.3   stringr_1.2.0   digest_0.6.14   xfun_0.1       
#> [17] evaluate_0.10.1