18 Matching HTML Tags
In this chapter we review and example that deals with some basic handling of HTML tags. The data for this practical application is the webpage for the R mailing lists: http://www.r-project.org/mail.html (see screenshot below)
If you visit the previous webpage you will see that there are five general mailing lists devoted to R:
R-announce is where major announcements about the development of R and the availability of new code.
R-help is the main R mailing list for discussion about problems and solutions using R.
R-package-devel is to get help about package development in R
R-devel is a list intended for questions and discussion about code development in R.
R-packages is a list of announcements on the availability of new or enhanced contributed packages.
Additionally, there are several specific Special Interest Group (SIG) mailing lists. Here’s a screenshot with some of the special groups:
18.1 Attributes href
As a simple example, suppose we wanted to get the href
attributes of all the SIG links. For instance, the href
attribute of the R-SIG-Mac link is:
https://stat.ethz.ch/mailman/listinfo/r-sig-mac
In turn the href
attribute of the R-sig-DB link is:
https://stat.ethz.ch/mailman/listinfo/r-sig-db
If we take a peek at the html source-code of the webpage, we’ll see that all the links can be found on lines like this one (in just one line of code):
"<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-mac\">
<code>R-SIG-Mac</code></a>: R Special Interest Group on Mac ports of R</p></li>"
18.1.1 Getting SIG links
The first step is to create a vector of character strings that will contain the lines of the mailing lists webpage. We can create this vector by simply passing the URL name to readLines()
:
# read html content
= readLines("http://www.r-project.org/mail.html") mail_lists
In case you are having problem downloading the HTML file, you can also find a copy in the github repository for data sets of this book. You can use the code below to download a copy of the file to your working directory:
# download file
<- "https://raw.githubusercontent.com/gastonstat/strings-data"
github <- "/main/data/mail.html"
textfile download.file(url = paste0(github, textfile), destfile = "mail.html")
Once you have the data in your working directory, you can import in R with readLines()
<- readLines("mail.html") mail_lists
The first elements in mail_lists
are:
head(mail_lists)
[1] "<!DOCTYPE html>"
[2] "<html lang=\"en\">"
[3] " <head>"
[4] " <meta charset=\"utf-8\">"
[5] " <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">"
[6] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">"
Once we’ve read the HTML content of the R mailing lists webpage, the next step is to define our regex pattern that matches the SIG links.
'^.*<p><a href="(https.*)">.*$'
Let’s examine the proposed pattern. By using the caret ^
and dollar sign $
we can describe our pattern as an entire line. Next to the caret we match anything zero or more times followed by a <td>
tag. Then there is a blank space matched zero or more times, followed by an anchor tag with its href
attribute. Note that we are using double quotation marks to match the href
attribute ("(https.*)"
). Moreover, the entire regex pattern is surrounded by single quotations marks ' '
. Here is how we can get the SIG links:
# SIG's href pattern
= '^.*<p><a href="(https.*)">.*$'
sig_pattern
# find SIG href attributes
= grep(sig_pattern, mail_lists, value = TRUE)
sig_hrefs
# let's see first 5 elements
head(sig_hrefs, n = 5)
[1] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-mac\"><code>R-SIG-Mac</code></a>: R Special Interest Group on Mac ports of R</p></li>"
[2] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-db\"><code>R-SIG-DB</code></a>: R SIG on Database Interfaces</p></li>"
[3] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-debian\"><code>R-SIG-Debian</code></a>: R Special Interest Group for Debian ports of R</p></li>"
[4] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-dynamic-models\"><code>R-SIG-dynamic-models</code></a>: Special Interest Group for Dynamic Simulation Models in R</p></li>"
[5] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-ecology\"><code>R-SIG-ecology</code></a>: Using R in ecological data analysis</p></li>"
We need to get rid of the extra html tags. We can easily extract the names of the note files using the sub()
function (since there is only one link per line, we don’t need to use gsub()
, although we could).
# get first matched group
= sub(sig_pattern, '\\1', sig_hrefs)
sigs sigs
[1] "https://stat.ethz.ch/mailman/listinfo/r-sig-mac"
[2] "https://stat.ethz.ch/mailman/listinfo/r-sig-db"
[3] "https://stat.ethz.ch/mailman/listinfo/r-sig-debian"
[4] "https://stat.ethz.ch/mailman/listinfo/r-sig-dynamic-models"
[5] "https://stat.ethz.ch/mailman/listinfo/r-sig-ecology"
[6] "https://stat.ethz.ch/mailman/listinfo/r-sig-epi"
[7] "https://stat.ethz.ch/mailman/listinfo/r-sig-fedora"
[8] "https://stat.ethz.ch/mailman/listinfo/r-sig-finance"
[9] "https://stat.ethz.ch/mailman/listinfo/r-sig-geo"
[10] "https://stat.ethz.ch/mailman/listinfo/r-sig-gr"
[11] "https://stat.ethz.ch/mailman/listinfo/r-sig-gui"
[12] "https://stat.ethz.ch/mailman/listinfo/r-sig-hpc"
[13] "https://stat.ethz.ch/mailman/listinfo/r-sig-insurance"
[14] "https://stat.ethz.ch/mailman/listinfo/r-sig-jobs"
[15] "https://stat.ethz.ch/mailman/listinfo/r-sig-mediawiki"
[16] "https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis"
[17] "https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models"
[18] "https://stat.ethz.ch/mailman/listinfo/r-sig-networks"
[19] "https://stat.ethz.ch/mailman/listinfo/r-sig-phylo"
[20] "https://stat.ethz.ch/mailman/listinfo/r-sig-qa"
[21] "https://stat.ethz.ch/mailman/listinfo/r-sig-robust"
[22] "https://stat.ethz.ch/mailman/listinfo/r-sig-teaching"
As you can see, we are using the regex pattern \\1
in the sub()
function. Generally speaking \\N
is replaced with the N
-th group specified in the regular expression. The first matched group is referenced by \\1
. In our example, the first group is everything that is contained in the curved brackets, that is: (https.*)
, which are in fact the links we are looking for.