Visually Enforced

a blog by Gaston Sanchez


Converting HTML files to PDF

Posted on November 10, 2017

Using wkhtmltopdf to convert HTML files into PDF format.

Motivation

For one of the courses I’m teaching this semester (Stat 133: Concepts in Computing with Data), I asked students to write a blog post. To be more precise, I asked them to write a report, in the form of a blog post, about one or more of the central topics covered in the course such as:

  • Data Visualization
  • Data Manipulation (reshaping, wrangling, formatting, tidying)
  • Programming for data analysis
  • Data Technologies
  • Reporting Tools

The submission format of the posts was in HTML. To keep things simple I asked them to use an R Markdown (Rmd) file, and knit it as an HTML document (the default knitting option), that they uploaded to a BOX folder. This means I ended up with about 270 HTML files that I wanted to share with all the students (and with the rest of the world). The issue was that HTML files don’t get rendered nicely in BOX or in GitHub.

So my problem became: How do I convert 270 HTML into PDF format … in an efficient way? I knew I could manually open each file, and then save it as PDF. But I didn’t want to repeat a handful of steps 270 times!

Luckily, I found a command line tool called wkhtmltopdf, which is exactly what I needed.

wkhtmltopdf is an open source command line tool to render HTML into PDF format using the Qt WebKit rendering engine. To convert an HTML file to PDF you simply run the wkhtmltopdf. Here’s the example used in the homepage of wkhtmltopdf to convert the Google logo and as a PDF:

wkhtmltopdf http://google.com google.pdf

Shell script

To convert all the files at once I wrote a shell script called convert2pdf.sh (see code below). To write the script I considered the following assumptions:

  • all the HTML files are in a folder called htmls
  • all HTML files have extension .html
  • the converted PDFs will be stored in a folder called pdfs

Here’s what the assumed file structure would look like:

mydir/
    convert2pdf.sh
    htmls/
        post01-deb-nolan.html
        post01-ani-adhikari.html
        post01-bin-yu.html
        post01-phil-stark.html
        post01-fernando-perez.html
    pdfs/
        ...

Here’s the content of the convert2pdf.sh script:

#!/bin/sh

# names of files (without extension)
files=$(ls -1 htmls | sed -e 's/\.html$//')

# convert files
for file in $files
do
	echo "converting ${file}.html to ${file}.pdf"
	wkhtmltopdf --dpi 1000 htmls/${file}.html pdfs/${file}.pdf
done

What’s going on?

The first command involves creating a variable files that contains the names of the HTML files (without the file extension). More specifically, I’m using ls to list the contents of the htmls directory, and then I pipe the output to a sed command. The sed command basically replaces the file extension with nothing (i.e. removes file extension).

The second part of the script consists of a for loop. At each iteration of the loop two commands are invoked: echo and wkthmltopdf.

The echo command is not that important, it’s just an informative message that displays the name of the file that is being converted at that iteration. In case things go wrong and the file conversion stops, you may want to know which file failed to be converted.

Then we have the wkhtmltopdf command that takes an input HTML files and converts it into an output PDF file. As you can tell, this command also uses the option --dpi 1000 (dots-per-inch). I found that I needed to use this option to avoid generating a PDF file with microscopic content. You probably want to try differnte dpi values and see which one is more convenient for you.

One option to run the shell script is with the sh command:

sh convert2pdf.sh

Once the loop is completed, the file structure should now look like this:

mydir/
    convert2pdf.sh
    htmls/
        post01-deb-nolan.html
        post01-ani-adhikari.html
        post01-bin-yu.html
        post01-phil-stark.html
        post01-fernando-perez.html
    pdfs/
        post01-deb-nolan.pdf
        post01-ani-adhikari.pdf
        post01-bin-yu.pdf
        post01-phil-stark.pdf
        post01-fernando-perez.pdf

Et Voilà!.

In case you are curious (and have some free time), you can find the the blog posts prepared by the Stat 133 students in the following github repository:

https://github.com/ucb-stat133/stat133-posts-fall17

Happy file conversion!


Published in categories how-to  Tagged with convert  html  wkhtmltopdf