16 BMC Journals Data

16.1 Introduction

In this example we will work analyzing some text data. We will analyze an old catalog of journals from the BioMed Central (BMC), a scientific publisher that specializes in open access journal publication. You can find more informaiton of BMC at: https://www.biomedcentral.com/journals

The data with the journal catalog is no longer available at BioMed’s website, but you can find a copy in the book’s github repository:

https://raw.githubusercontent.com/gastonstat/r4strings/master/data/biomedcentral.txt

To download a copy of the text file to your working directory, run the following code:

To import the data in R you can read the file with read.table():

You can check the structure of the data with the function str():

As you can see, the data frame biomed has 336 observations and 7 variables. Actually, all the variables except for Start.Date are in character mode.

16.2 Analyzing Journal Names

We will do a simple analysis of the journal names. The goal is to study what are the more common terms used in the title of the journals. We are going to keep things at a basic level but for a more formal (and sophisticated) analysis you can check the package tm —text mining— (by Ingo Feinerer).

To have a better idea of what the data looks like, let’s check the first journal names.

As you can tell, the fifth journal "Addiction Science & Clinical Practice" has an ampersand & symbol. Whether to keep the ampersand and other punctutation symbols depends on the objectives of the analysis. In our case, we will remove those elements.

16.2.1 Preprocessing

The preprocessing steps implies to get rid of the punctuation symbols. For convenience, I recommended that you start working with a small subset of the data. In this way you can experiment at a small scale until we are confident with the right manipulations. Let’s take the first 10 journals:

We want to get rid of the ampersand signs &, as well as other punctuation marks. This can be done with str_replace_all() and replacing the pattern [[:punct:]] with empty strings "" (don’t forget to load the "stringr" package)

We succesfully replaced the punctuation symbols with empty strings, but now we have extra whitespaces. To remove the whitespaces we will use again str_replace_all() to replace any one or more whitespaces
\\s+ with a single blank space " ".

Once we have a better idea of how to preprocess the journal names, we can proceed with all the 336 titles.

The next step is to split up the titles into its different terms (the output is a list).

16.3 Summary statistics

So far we have a list that contains the words of each journal name. Wouldn’t be interesting to know more about the distribution of the number of terms in each title? This means that we need to calculate how many words are in each title. To get these numbers let’s use length() within sapply(); and then let’s tabulate the obtained frequencies:

We can also express the distribution as percentages, and we can get some summary statistics with summary()

Looking at summary statistics we can say that around 30% of journal names have 2 words. Likewise, the median number of words per title is 3 words.

Interestingly the maximum value is 9 words. What is the journal with 9 terms in its title? We can find the longest journal name as follows:

16.4 Common words

Remember that our main goal with this example is to find out what words are the most common in the journal titles. To answer this question we first need to create something like a dictionary of words. How do get such dictionary? Easy, we just have to obtain a vector containing all the words in the titles:

Applying unique() to the vector title_words we get the desired dictionary of terms, which has a total of 441 words.

Once we have the unique words, we need to count how many times each of them appears in the titles. Here’s a way to do that:

An alternative simpler way to count the number of word occurrences is by using the table() function on title\_words:

In any of both cases (count_words or count_words_alt), we can examine the obtained frequencies with a simple table: