1 Introduction

The Web is full of information and resources that can be considered to be sources of data. Statisticians, data analysts, data scientists, researchers, and data-based users in general, increasingly are working in projects that depend on various data sources, many of them either coming or available from the Web.

As it turns out, we can use a wide array of approaches to get data from the Web. For instance, we can simply scrape data from human-readable webpages. Likewise, we can also utilize application programming interfaces (APIs) to request some data sets. Interestingly, the data may come in some XML dialect, the most common one being HTML. But it can also come in a JSON document or some other self-describing format. Consequently, you need to be prepared to deal with data from the Web.

1.1 Suggested Tools

To enjoy the content of this text, and also to be able to replicate the examples discussed in subsequent chapters, you will need the following tools:

  • A fairly recent version of R

  • A fairly recent version of RStudio

  • Web Browser (e.g. Chrome, Safari, Firefox, Opera)

  • and good Internet connection!

1.2 Suggested R Packages

The code and examples shown in this book are based on the following packages:

  • "tidyverse" which contains, among other packages:

    • "dplyr": for manipulation of data tables
    • "stringr": for manipulation of strings and text data
  • "xml2": tools for parsing XML and HTML documents

  • "httr": tools for working with HTTP requests

  • "rvest": for harvesting or scraping web data in an easy way

  • "jsonlite": functions for handling JSON data

By the way, R has a large collection of packages for interacting with the Web. A comprehensive list of packages for dealing with Web Technologies is available in the following Cran Task View (curated by Mauricio Vargas Sepulveda):

https://cran.r-project.org/web/views/WebTechnologies.html

1.3 Some Acronyms

As you’ll see later in this book, there is a number of acronyms commonly used around all-things Web. I will define and explain every acronym in their corresponding chapter. In the meantime, I would like to give you a first exposure to the following terms:

  • WWW: World Wide Web

  • URL: Uniform Resource Locator

  • HTTP: HyperText Transfer Protocol

  • XML: Extensible Markup Language

  • HTML: HyperText Markup Language

  • JSON: JavaScript Object Notation

  • API: Application Programming Interface