Using Python to Convert PDFs to Images

Python is like that annoying kid in school who thinks he can do anything, but in this case, he can! With Python you can build web applications, AI models, chatbots and just about anything you can imagine.

Python is also really good at acting as a wrapper for complex tasks. Often, these can be core level libraries that have bindings in Python. One common task that most developers perform is file manipulation—either writing or reading from a file, or manipulation of file types.

In this article, I’ll compare and contrast various Python libraries that can convert PDFs to images.

Installing Python

If you already have Python installed, you can skip this step. However, for those who haven’t, read on.

To get started, you can:

  1. Create a free ActiveState Platform account
  2. Download:
    1. ActivePython, which is a pre-built version of Python containing hundreds of packages that can help you solve your common tasks
    2. The “PDF To JPG” runtime, which contains a version of Python and most of the tools listed in this post so you can test them out for yourself.

NOTE: the simplest way to install the PDF to JPG environment is to first install the ActiveState Platform’s command line interface (CLI), the State Tool.

  • If you’re on Windows, you can use Powershell to install the State Tool:
    IEX(New-Object Net.WebClient).downloadString('https://platform.activestate.com/dl/cli/install.ps1')
  • If you’re on Linux / Mac, you can use curl to install the State Tool:
    sh <(curl -q https://platform.activestate.com/dl/cli/install.sh)

Once the State Tool is installed, just run the following command to download the build and automatically install it into a virtual environment.

state activate Pizza-Team/PDF-TO-JPG

And that’s it! You now have installed Python in a virtual environment.

Using Python to Convert PDFs to Images: Ghostscript for Manipulating PDFs

A very popular tool for manipulating PDF and PostScript formats is Ghostscript. It’s a C library that has bindings in Python in order to provide for easy access from various applications.

Ghostscript has been around since 1988, and the last release happened a few months ago (April 2019 as of this writing). It’s safe to say that this library is not only proven, but actively managed. However, be aware that it’s licensed with the GNU Affero General Public License (AGPL), which may prevent it from being a good fit for enterprise applications.

To get started, install the Python Ghostscript package:

"`
pip install ghostscript
"`

Let’s look at the code to convert a PDF file to an image. This is straightforward, and you will find most of the code in the PyPI documentation page.

import ghostscript
import locale

def pdf2jpeg(pdf_input_path, jpeg_output_path):
    args = ["pef2jpeg", # actual value doesn't matter
            "-dNOPAUSE",
            "-sDEVICE=jpeg",
            "-r144",
            "-sOutputFile=" + jpeg_output_path,
            pdf_input_path]

    encoding = locale.getpreferredencoding()
    args = [a.encode(encoding) for a in args]

    ghostscript.Ghostscript(*args)

pdf2jpeg(
    "...Fixate/ActiveState/pdf/a.pdf",
    "...Fixate/ActiveState/pdf/a.jpeg",
)

To execute the file, run:

"`
python gh.py
"`

You will encounter this error:

The last line says:

"`
RuntimeError: Can not find Ghostscript library (libgs)
"`

This means that the Ghostscript Python library we installed isn’t able to find the Ghostscript C library on the development machine. The Python package is just a wrapper around the C library that actually does all the work. So we need to do a second install in order to deploy the C library on our machine.

If you’re on a Mac with brew installed, you can just run:

"`
brew install ghostscript
"`

To see installation steps for other platforms, please visit the Ghostscript installation page.

Executing the script gh.py again will now perform the conversion of a PDF file named a.pdf into a graphic file named a.jpeg.

Ghostscript was first introduced to manage PostScript files, a file format used by printers and fax machines (yes, fax!). But even in the publishing industry, PostScript files have almost entirely been replaced by PDFs. Originally, PDFs were just compiled PostScript files, but since PDF v1.4, Adobe no longer uses PostScript as the basis of the PDF format. Even so, Ghostscript still includes both PDF and PostScript manipulation capabilities.

Advantages of Ghostscript:

  • Has been around for more than 30 years, and is still consistently maintained.
  • Has easy bindings for Python.
  • Has an extensive feature list.

Disadvantages of Ghostscript:

  • Needs the C library to be installed first, as the Python package is just a wrapper for the core C library that does the actual conversion.
  • AGPL-licensed, which may limit usage in commercial applications.

Using Python to Convert PDFs to Images: Poppler and pdf2image for PDF Conversion

Poppler is an open-source software utility built using C++ for rendering PDF documents. It is commonly used across Linux, GNOME and KDE systems. Its development is supported by freedesktop.org.

Poppler was initially launched in 2005 and is still actively supported. The Python package pdf2image is a Python wrapper for Poppler.

Since ActiveState’s Python already contains the pdf2image Python wrapper, all we need to install is the Poppler C++ library:

"`
brew install poppler
"`

Now, it’s extremely straightforward to convert a PDF to an image:

from pdf2image import convert_from_path

pages = convert_from_path('...Fixate/ActiveState/pdf/a.pdf', 500)
for page in pages:
    page.save('p2ijpg', 'JPEG')

Both Poppler and Ghostscript have the advantage of being mature software utility tools.  However, Ghostscript was created primarily to manage Postscript files, while Poppler—from its inception—was only meant to be a PDF manipulation tool. With Poppler, you can perform any action on PDF files, including creation, merging, and even converting. It pays to be built 15 years after your competition!

Advantages of pdf2image:

  • Has been around for almost 15 years, and is still consistently maintained.
  • Has easy bindings for Python.
  • pdf2image features an MIT license, which is generally acceptable for enterprise/commercial use.

Disadvantages of pdf2image:

  • It requires a C++ library to be installed, as the Python package is just a wrapper.

Using Python to Convert PDFs to Images: Extracting Data from PDF Files with PyPDF2

All the examples we’ve spoken about so far are Python wrappers for a much larger C or C++ codebase. With PyPDF2, the entire PDF manipulation logic is written only in Python. This means there is no need to install any other any other dependent libraries. However, this also means that while PyPDF2 is great at creating, adding and removing pages, it struggles to convert and extract textual data from a PDF file.

Let’s look at how text can be extracted from a PDF:

import PyPDF2

pdfFileObj = open('...Fixate/ActiveState/pdf/a.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

pdfFileObj.close()

With PyPDF2, it is quite simple to manipulate PDFs programmatically. The Python syntax is extremely intuitive. This would be useful in scenarios where information needs to be extracted and then processed in a larger workflow.

However, it’s important to note that text extraction is only possible when a PDF is programmatically created. If the PDF is just a scanned image of a document, PyPDF2 has nothing to extract other than the image file itself.

PyPDF2 also doesn’t have any capabilities to convert a PDF file into an image, which is understandable since it does not use any core PDF libraries. So if you want to convert your PDF to an image file, the best you can do is extract text and write it to an image file.

Advantages of PyPDF2:

  • Written entirely in Python, so there’s no “helper” library to install.
  • pdf2image features a BSD-3 license, which is generally acceptable for enterprise/commercial use.

Disadvantages of PyPDF2:

  • Very limited functionality for scanned PDF files.
  • Much slower compared to Ghostscript and pdf2image, since the code is pure Python.

Using Python to Convert PDFs to Images: Conclusions

Python is loaded with packages that make large, complex tasks achievable with just a few lines, and PDF manipulation is no different. Although a full-featured, Python-only package has yet to be released, solutions that act as wrappers around C/C++ libraries work great for converting PDF files directly to images. In this case. it’s really a toss up between Ghostscript and pdf2image unless your company frowns on AGPL-licensed code. But if you’re looking to just extract specific data from PDF files, PyPDF2 is a great Python-only solution.

Download ActivePython - call to action

Use ActivePython and accelerate your Python projects.

  • The #1 Python solution used by innovative enterprise teams
  • Comes pre-bundled with top Python packages
  • Spend less time resolving dependencies and more time on quality coding

Take a look at ActivePython

Swaathi Kakarla

Swaathi Kakarla

Guest blogger: Swaathi Kakarla is the co-founder and CTO at Skcript. She enjoys talking and writing about code efficiency, performance, and startups. In her free time, she finds solace in yoga, bicycling and contributing to open source.