Can Python Read PDF Files?


Can Python Read PDF Files?

Python is a great tool for task automation, it makes working with text files and data sheets really easy. But can you use Python to read PDF files?

There are plenty of great Python libraries that can be used to parse pdf files, for example: PDFMiner, PyPDF2, tabula-py, slate, PDFQuery, xpdf_python, pdflib and PyMuPDF

In this brief tutorial I’ll show you how to install and use each of these libraries to read pdfs.

1. Reading PDF File Contents With PDFMiner

PDFMiner is a library for pdf to text and text to pdf conversion. It can be used as an importable module in your Python scripts, but it also comes with a CLI interface, so you can invoke pdfminer directly from the command line as well.

Attention: The original pdfminer package is deprecated, as the repo has been abandoned by the original author. Make sure to install its community fork, pdfminer.six instead!

pip install pdfminer.six

CLI usage:

python pdf2txt.py /path/to/your/file.pdf

If you want to use it in your Python script you can simply do:

import pdfminer.high_level
contents = pdfminer.high_level.extract_text("/path/to/your/file.pdf")

2. Extracting Text With PyPDF2

PyPDF2 is feature-rich Python library that makes manipulating PDF files easier. It can extract metadata, text and images, and can also modify PDF files by cropping, merging and splitting PDFs.

You can install it by running:

pip install pypdf2

To read text from PDF files you can use the PdfFileReader class, like so:

from PyPDF2 import PdfFileReader

contents = ""
with open("/path/to/your/file.pdf", 'rb') as f:
    pdf = PdfFileReader(f)
    for page_num in range(pdf.getNumPages()):
        page = pdf.getPage(1)
        contents += page.extractText()

This little snippet gets the number of pages from the metadata, then iterates through all the pages, and extracts the text content from each page one-by-one.

3. Importing Tabular Data Into Pandas With Tabula-py

Tabula-py is a bit more specific tool: it is specialized on reading tables from PDF files. It returns the data as a pandas DataFrame, but you can also export it into TSV or CSV format.

Installation is simple with pip:

pip install tabula-py

Using it is pretty straightforward as well:

import tabula
df = tabula.read_pdf("/path/to/your/file.pdf", pages='all')

df will be a pandas DataFrame containing all the data that tabula-py manages to find in tabular format inside the input file.

4. Slate

Slate is a wrapper around PDFMiner. It provides roughly the same feature set, but with a much cleaner, pythonic interface.

Installation:

pip install slate

Usage:

with open("/path/to/your/file.pdf") as input_file:
    contents = slate.PDF(input_file)

contents will be a list of strings, where each element

5. Scraping And Querying PDF Files With PDFQuery

If you need to do some more sophisticated manipulation of PDF data besides just dumping all the contents of the file as raw text, your best bet would be PDFQuery. It allows you to traverse the document tree, just like you would the with an xml or html document.

PDFQuery supports both XPath and JQuery syntax for querying.

pip install pdfquery
import pdfquery
pdf = pdfquery.PDFQuery("/path/to/your/file.pdf")

pdf variable will now contain a traversable and searchable representation of the PDF document. Contents of this document can be exported in arbitrary, user-defined format.

You can also search the contents of the document, for example:

element = pdf.pq(':contains("text to find")')

6. Xpdf_python

xpdf_python is a wrapper for xpdf. It can export pdf files to text format.

As always installation is easy with pip:

pip install xpdf_python

To get the contents of a pdf file as a string:

from xpdf_python import to_text		
contents = to_text("/path/to/your/file.pdf")

7. Pdflib

Pdflib provides Python binding for the Poppler pdf library. Pdflib can be installed by running:

pip install pdflib

Parsing pdf files is pretty easy using pdflib:

from pdflib import Document
pdf = Document("/path/to/your/file.pdf")
content = [line for page in doc for line in page.lines]

The above snippet will gather all the text in the pdf in the content variable line-by-line.

8. PyMuPDF

PyMuPDF provides Python bindings for MuPDF, a lightweight PDF/e-book viewer.

Installation:

pip install pymupdf

Reading a PDF file into variable:

doc = fitz.open("/path/to/your/file.pdf")
content = [page.getText() for page in doc]

content will be a list of pages, containing the content of each page as a string element.

Summary

That was the 8 most popular Python libraries that can be used to read pdf data. So which one should you pick?

If you need to parse data tables, I’d definitely recommend tabula-py, as it exports directly to a pandas DataFrame.

If you want to programmatically search in a pdf file, or extract only parts if, you should choose PDFQuery.

However, if you need nothing fancy, just dump the contents of the file, any of the others will do, but I’d probably go with pdflib or PyMuPDF`. They are actively maintained, fast, robust, easy to install, and provide a clean interface to work with.

References