Automating PDF Error Scanning with Python

Elisha Antunes
Jun 15
2 min read

Updated: Jun 24

In any organization that processes large volumes of daily transactions—whether they're financial, operational, or policy-driven—it's common to generate automated reports summarizing those activities. These reports often span dozens or even hundreds of pages and are typically reviewed to confirm that all records processed correctly.

In my role, I was responsible for reviewing a set of these reports on a daily basis. The reports were generated in PDF format and included numerous records—each representing some form of transaction (e.g., record creation, update, cancellation, or financial activity).

While most records processed without issue, occasionally the system would generate an error message embedded within the report. It was my job to identify these errors and investigate them.

The challenge? These error messages were scattered within massive, text-dense documents, and the only way to find them was to open each PDF and scan through them manually—every day.

The Problem

Manual error detection in PDF reports is:

Time-consuming: Reviewing each document line by line to look for a small set of potential issues is inefficient.
Inconsistent: Human attention varies; errors could be overlooked or misread.
Non-scalable: As the volume of reports grew, so did the time required to review them.

What I needed was a way to programmatically check these reports for known error messages and surface them instantly—without requiring human inspection.

The Solution: Python-Powered PDF Error Scanner

To reduce this manual burden, I built a lightweight Python program that automates error detection in PDF reports. It identifies known error messages, captures related details (such as the policy number and page), and exports the findings to an Excel file for easy triage.

How It Works:

Text Extraction with PyPDF2
- Each PDF is read using PyPDF2, which parses and extracts plain text from every page.
- The script gracefully skips any blank pages or extraction failures.
Keyword Detection
- A set of known error phrases is stored in a configurable list. These include terms like POLICY NOT ADDED, DUPLICATE CARRIER POL#, and PAYMENT ERROR, among others.
- The script uses case-insensitive regular expressions to search each page’s text for these phrases.
Pattern Matching for Context
- Alongside the error phrases, the script searches for policy numbers using a defined regex pattern (e.g., a specific alphanumeric format).
- If a keyword is found, it records the filename, page number, keyword(s), and the policy number—if present.
Folder-Based Batch Processing
- The program scans all .pdf files in a given folder. There's no need to point it at individual reports—it picks up everything automatically.
Export to Excel
- Results are consolidated into a structured DataFrame and exported to Excel with labeled columns. This makes follow-up investigation quick and clear.

Sample Output

Filename	Page Number	Keywords Found	Policy Number
report1.pdf	4	POLICY NOT ADDED, PAYMENT ERROR	ABC2500123
report2.pdf	2	DUPLICATE CARRIER POL#, CONTACT BROKER	Not Found

Want to see it in action?

Access my GitHub repository and follow the steps in the README to run the tool yourself using the sample PDF.

Takeaways

This tool delivered immediate, measurable benefits:

Time saved: What previously took 20–30 minutes daily now runs in seconds.
Cognitive relief: No more fatigue from repetitive text scanning.
Accuracy: Catches every known error phrase—no misses, no guesswork.
Reusable: Additional keywords or formats can be added with ease, without altering the core logic.