Converting HTML files to PDF
Posted on November 10, 2017
Using wkhtmltopdf to convert HTML files into PDF format.
For one of the courses I’m teaching this semester (Stat 133: Concepts in Computing with Data), I asked students to write a blog post. To be more precise, I asked them to write a report, in the form of a blog post, about one or more of the central topics covered in the course such as:
- Data Visualization
- Data Manipulation (reshaping, wrangling, formatting, tidying)
- Programming for data analysis
- Data Technologies
- Reporting Tools
The submission format of the posts was in HTML. To keep things simple I
asked them to use an R Markdown (
Rmd) file, and knit it as an HTML
document (the default knitting option), that they uploaded to a BOX folder.
This means I ended up with about 270 HTML files that I wanted to share
with all the students (and with the rest of the world). The issue was that
HTML files don’t get rendered nicely in BOX or in GitHub.
So my problem became: How do I convert 270 HTML into PDF format … in an efficient way? I knew I could manually open each file, and then save it as PDF. But I didn’t want to repeat a handful of steps 270 times!
Luckily, I found a command line tool called wkhtmltopdf, which is exactly what I needed.
wkhtmltopdf is an open source command line tool to render HTML into PDF
format using the Qt WebKit rendering engine. To convert an HTML file
to PDF you simply run the
wkhtmltopdf. Here’s the example used in the
homepage of wkhtmltopdf to convert the Google logo and as a PDF:
To convert all the files at once I wrote a shell script called
convert2pdf.sh (see code below). To write the script I considered
the following assumptions:
- all the HTML files are in a folder called
- all HTML files have extension
- the converted PDFs will be stored in a folder called
Here’s what the assumed file structure would look like:
Here’s the content of the
What’s going on?
The first command involves creating a variable
files that contains
the names of the HTML files (without the file extension). More specifically,
ls to list the contents of the
htmls directory, and then
I pipe the output to a
sed command. The
sed command basically
replaces the file extension with nothing (i.e. removes file extension).
The second part of the script consists of a
for loop. At each iteration
of the loop two commands are invoked:
echo command is not that important, it’s just an informative message
that displays the name of the file that is being converted at that iteration.
In case things go wrong and the file conversion stops, you may want to
know which file failed to be converted.
Then we have the
wkhtmltopdf command that takes an input HTML files
and converts it into an output PDF file. As you can tell, this command
also uses the option
--dpi 1000 (dots-per-inch). I found that I needed
to use this option to avoid generating a PDF file with microscopic
content. You probably want to try differnte dpi values and see which one
is more convenient for you.
One option to run the shell script is with the
Once the loop is completed, the file structure should now look like this:
In case you are curious (and have some free time), you can find the the blog posts prepared by the Stat 133 students in the following github repository:
Happy file conversion!