18 Matching HTML Tags
18.1 Introduction
In this example we will deal with some basic handling of HTML tags. The data for this practical application is the webpage for the R mailing lists: http://www.r-project.org/mail.html (see screenshot below)
If you visit the previous webpage you will see that there are five general mailing lists devoted to R:
- R-announce is where major announcements about the development of R and the availability of new code.
- R-help is the main R mailing list for discussion about problems and solutions using R.
- R-package-devel is to get help about package development in R
- R-devel is a list intended for questions and discussion about code development in R.
- R-packages is a list of announcements on the availability of new or enhanced contributed packages.
Additionally, there are several specific (SIG) mailing lists. Here’s a screenshot with some of the special groups:
18.2 Attributes href
As a simple example, suppose we wanted to get the href
attributes of all the SIG links. For instance, the href
attribute of the R-SIG-Mac link is: https://stat.ethz.ch/mailman/listinfo/r-sig-mac
In turn the href
attribute of the R-sig-DB link is: https://stat.ethz.ch/mailman/listinfo/r-sig-db
If we take a peek at the html source-code of the webpage, we’ll see that all the links can be found on lines like this one:
"<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-mac\"><code>R-SIG-Mac</code></a>: R Special Interest Group on Mac ports of R</p></li>"
18.2.1 Getting SIG links
The first step is to create a vector of character strings that will contain the lines of the mailing lists webpage. We can create this vector by simply passing the URL name to readLines()
:
The first elements in mail_lists
are:
head(mail_lists)
#> [1] "<!DOCTYPE html>"
#> [2] "<html lang=\"en\">"
#> [3] " <head>"
#> [4] " <meta charset=\"utf-8\">"
#> [5] " <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">"
#> [6] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">"
Once we’ve read the HTML content of the R mailing lists webpage, the next step is to define our regex pattern that matches the SIG links.
'^.*<p><a href="(https.*)">.*$'
Let’s examine the proposed pattern. By using the caret ^
and dollar sign $
we can describe our pattern as an entire line. Next to the caret we match anything zero or more times followed by a <td>
tag. Then there is a blank space matched zero or more times, followed by an anchor tag with its href
attribute. Note that we are using double quotation marks to match the href
attribute ("(https.*)"
). Moreover, the entire regex pattern is surrounded by single quotations marks ' '
. Here is how we can get the SIG links:
# SIG's href pattern
sig_pattern = '^.*<p><a href="(https.*)">.*$'
# find SIG href attributes
sig_hrefs = grep(sig_pattern, mail_lists, value = TRUE)
# let's see first 5 elements
head(sig_hrefs, n = 5)
#> [1] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-mac\"><code>R-SIG-Mac</code></a>: R Special Interest Group on Mac ports of R</p></li>"
#> [2] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-db\"><code>R-SIG-DB</code></a>: R SIG on Database Interfaces</p></li>"
#> [3] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-debian\"><code>R-SIG-Debian</code></a>: R Special Interest Group for Debian ports of R</p></li>"
#> [4] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-dynamic-models\"><code>R-SIG-dynamic-models</code></a>: Special Interest Group for Dynamic Simulation Models in R</p></li>"
#> [5] "<li><p><a href=\"https://stat.ethz.ch/mailman/listinfo/r-sig-ecology\"><code>R-SIG-ecology</code></a>: Using R in ecological data analysis</p></li>"
We need to get rid of the extra html tags. We can easily extract the names of the note files using the sub()
function (since there is only one link per line, we don’t need to use gsub()
, although we could).
# get first matched group
sigs = sub(sig_pattern, '\\1', sig_hrefs)
sigs
#> [1] "https://stat.ethz.ch/mailman/listinfo/r-sig-mac"
#> [2] "https://stat.ethz.ch/mailman/listinfo/r-sig-db"
#> [3] "https://stat.ethz.ch/mailman/listinfo/r-sig-debian"
#> [4] "https://stat.ethz.ch/mailman/listinfo/r-sig-dynamic-models"
#> [5] "https://stat.ethz.ch/mailman/listinfo/r-sig-ecology"
#> [6] "https://stat.ethz.ch/mailman/listinfo/r-sig-epi"
#> [7] "https://stat.ethz.ch/mailman/listinfo/r-sig-fedora"
#> [8] "https://stat.ethz.ch/mailman/listinfo/r-sig-finance"
#> [9] "https://stat.ethz.ch/mailman/listinfo/r-sig-geo"
#> [10] "https://stat.ethz.ch/mailman/listinfo/r-sig-gr"
#> [11] "https://stat.ethz.ch/mailman/listinfo/r-sig-gui"
#> [12] "https://stat.ethz.ch/mailman/listinfo/r-sig-hpc"
#> [13] "https://stat.ethz.ch/mailman/listinfo/r-sig-insurance"
#> [14] "https://stat.ethz.ch/mailman/listinfo/r-sig-jobs"
#> [15] "https://stat.ethz.ch/mailman/listinfo/r-sig-mediawiki"
#> [16] "https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis"
#> [17] "https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models"
#> [18] "https://stat.ethz.ch/mailman/listinfo/r-sig-networks"
#> [19] "https://stat.ethz.ch/mailman/listinfo/r-sig-phylo"
#> [20] "https://stat.ethz.ch/mailman/listinfo/r-sig-qa"
#> [21] "https://stat.ethz.ch/mailman/listinfo/r-sig-robust"
#> [22] "https://stat.ethz.ch/mailman/listinfo/r-sig-teaching"
As you can see, we are using the regex pattern \\1
in the sub()
function. Generally speaking \\N
is replaced with the N
-th group specified in the regular expression. The first matched group is referenced by \\1
. In our example, the first group is everything that is contained in the curved brackets, that is: (https.*)
, which are in fact the links we are looking for.
Make a donation
If you find this resource useful, please consider making a one-time donation in any amount. Your support really matters.