Language: Python
Data Science
PyPDF was originally created to provide an easy-to-use library for PDF manipulation in Python without relying on external tools. It has evolved over time to support modern PDF features and remains widely used for automating PDF-related workflows.
PyPDF is a pure Python library for working with PDF documents. It allows you to read, write, merge, split, and manipulate PDF files programmatically.
pip install pypdfconda install -c conda-forge pypdfPyPDF allows you to read text from PDFs, merge multiple PDFs into one, split PDFs into pages, rotate pages, encrypt/decrypt PDFs, and extract metadata. It integrates easily into Python scripts for automating document handling.
from pypdf import PdfReader
reader = PdfReader('document.pdf')
for page in reader.pages:
print(page.extract_text())Opens a PDF and extracts text from each page.
from pypdf import PdfMerger
merger = PdfMerger()
merger.append('file1.pdf')
merger.append('file2.pdf')
merger.write('merged.pdf')
merger.close()Combines multiple PDFs into a single merged PDF.
from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
writer.add_page(reader.pages[0])
writer.write('page1.pdf')Extracts the first page of a PDF and saves it as a new PDF.
from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90)
writer.add_page(page)
writer.write('rotated.pdf')Rotates the first page of a PDF by 90 degrees and saves it.
from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
writer.append_pages_from_reader(reader)
writer.add_metadata({'/Author': 'John Doe', '/Title': 'Sample PDF'})
writer.write('document_with_metadata.pdf')Copies pages from an existing PDF and adds author and title metadata.
from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
writer = PdfWriter()
writer.append_pages_from_reader(reader)
writer.encrypt('password')
writer.write('encrypted.pdf')Encrypts a PDF file with a password.
Always close files or use context managers to avoid resource leaks.
Validate PDF input files before processing.
Use `PdfWriter` to create or modify PDFs instead of modifying `PdfReader` objects directly.
Keep backups when manipulating important PDFs.
Use metadata to improve document organization and searchability.