Python is a great tool for task automation, it makes working with text files and data sheets really easy. But can you use Python to read PDF files?
There are plenty of great Python libraries that can be used to parse pdf files, for example: PDFMiner, PyPDF2, tabula-py, slate, PDFQuery, xpdf_python, pdflib and PyMuPDF
In this brief tutorial I’ll show you how to install and use each of these libraries to read pdfs.
1. Reading PDF File Contents With PDFMiner
PDFMiner is a library for pdf to text and text to pdf conversion. It can be used as an importable module in your Python scripts, but it also comes with a CLI interface, so you can invoke pdfminer directly from the command line as well.
Attention: The original
pdfminer package is deprecated, as the repo has been abandoned by the original author. Make sure to install its community fork,
pip install pdfminer.six
python pdf2txt.py /path/to/your/file.pdf
If you want to use it in your Python script you can simply do:
import pdfminer.high_level contents = pdfminer.high_level.extract_text("/path/to/your/file.pdf")
2. Extracting Text With PyPDF2
PyPDF2 is feature-rich Python library that makes manipulating PDF files easier. It can extract metadata, text and images, and can also modify PDF files by cropping, merging and splitting PDFs.
You can install it by running:
pip install pypdf2
To read text from PDF files you can use the
PdfFileReader class, like so:
from PyPDF2 import PdfFileReader contents = "" with open("/path/to/your/file.pdf", 'rb') as f: pdf = PdfFileReader(f) for page_num in range(pdf.getNumPages()): page = pdf.getPage(1) contents += page.extractText()
This little snippet gets the number of pages from the metadata, then iterates through all the pages, and extracts the text content from each page one-by-one.
3. Importing Tabular Data Into Pandas With Tabula-py
Tabula-py is a bit more specific tool: it is specialized on reading tables from PDF files. It returns the data as a pandas DataFrame, but you can also export it into TSV or CSV format.
Installation is simple with pip:
pip install tabula-py
Using it is pretty straightforward as well:
import tabula df = tabula.read_pdf("/path/to/your/file.pdf", pages='all')
df will be a pandas DataFrame containing all the data that tabula-py manages to find in tabular format inside the input file.
Slate is a wrapper around PDFMiner. It provides roughly the same feature set, but with a much cleaner, pythonic interface.
pip install slate
with open("/path/to/your/file.pdf") as input_file: contents = slate.PDF(input_file)
contents will be a list of strings, where each element
5. Scraping And Querying PDF Files With PDFQuery
If you need to do some more sophisticated manipulation of PDF data besides just dumping all the contents of the file as raw text, your best bet would be PDFQuery. It allows you to traverse the document tree, just like you would the with an xml or html document.
PDFQuery supports both XPath and JQuery syntax for querying.
pip install pdfquery
import pdfquery pdf = pdfquery.PDFQuery("/path/to/your/file.pdf")
You can also search the contents of the document, for example:
element = pdf.pq(':contains("text to find")')
xpdf_python is a wrapper for xpdf. It can export pdf files to text format.
As always installation is easy with pip:
pip install xpdf_python
To get the contents of a pdf file as a string:
from xpdf_python import to_text contents = to_text("/path/to/your/file.pdf")
Pdflib provides Python binding for the Poppler pdf library. Pdflib can be installed by running:
pip install pdflib
Parsing pdf files is pretty easy using pdflib:
from pdflib import Document pdf = Document("/path/to/your/file.pdf") content = [line for page in doc for line in page.lines]
The above snippet will gather all the text in the pdf in the
content variable line-by-line.
PyMuPDF provides Python bindings for MuPDF, a lightweight PDF/e-book viewer.
pip install pymupdf
Reading a PDF file into variable:
doc = fitz.open("/path/to/your/file.pdf") content = [page.getText() for page in doc]
content will be a list of pages, containing the content of each page as a string element.
That was the 8 most popular Python libraries that can be used to read pdf data. So which one should you pick?
If you need to parse data tables, I’d definitely recommend
tabula-py, as it exports directly to a pandas
If you want to programmatically search in a pdf file, or extract only parts if, you should choose
However, if you need nothing fancy, just dump the contents of the file, any of the others will do, but I’d probably go with
pdflib or PyMuPDF`. They are actively maintained, fast, robust, easy to install, and provide a clean interface to work with.