PyPDF

Language: Python

Data Science

PyPDF was originally created to provide an easy-to-use library for PDF manipulation in Python without relying on external tools. It has evolved over time to support modern PDF features and remains widely used for automating PDF-related workflows.

PyPDF is a pure Python library for working with PDF documents. It allows you to read, write, merge, split, and manipulate PDF files programmatically.

Installation

pip: pip install pypdf
conda: conda install -c conda-forge pypdf

Usage

PyPDF allows you to read text from PDFs, merge multiple PDFs into one, split PDFs into pages, rotate pages, encrypt/decrypt PDFs, and extract metadata. It integrates easily into Python scripts for automating document handling.

Reading a PDF

from pypdf import PdfReader
reader = PdfReader('document.pdf')
for page in reader.pages:
    print(page.extract_text())

Opens a PDF and extracts text from each page.

Merging PDFs

from pypdf import PdfMerger
merger = PdfMerger()
merger.append('file1.pdf')
merger.append('file2.pdf')
merger.write('merged.pdf')
merger.close()

Combines multiple PDFs into a single merged PDF.

Splitting a PDF

from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
writer.add_page(reader.pages[0])
writer.write('page1.pdf')

Extracts the first page of a PDF and saves it as a new PDF.

Rotating pages

from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90)
writer.add_page(page)
writer.write('rotated.pdf')

Rotates the first page of a PDF by 90 degrees and saves it.

Adding metadata

from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
writer.append_pages_from_reader(reader)
writer.add_metadata({'/Author': 'John Doe', '/Title': 'Sample PDF'})
writer.write('document_with_metadata.pdf')

Copies pages from an existing PDF and adds author and title metadata.

Encrypting a PDF

from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
writer.append_pages_from_reader(reader)
writer.encrypt('password')
writer.write('encrypted.pdf')

Encrypts a PDF file with a password.

Error Handling

FileNotFoundError: Ensure the PDF file path is correct.
PdfReadError: The PDF may be corrupted or encrypted. Check the file integrity.
PermissionError: Ensure you have write permissions when creating or modifying PDFs.

Best Practices

Always close files or use context managers to avoid resource leaks.

Validate PDF input files before processing.

Use `PdfWriter` to create or modify PDFs instead of modifying `PdfReader` objects directly.

Keep backups when manipulating important PDFs.

Use metadata to improve document organization and searchability.