Intelligent Document Processing Pipeline
A lending platform was spending 15 hours a day on manual document review. We built a six-stage pipeline that classifies, extracts, validates, and routes documents — reducing processing time by 95%.
The Situation
Every loan application came with a stack of supporting documents — pay stubs, bank statements, tax returns, identity documents, proof of address — arriving in every format imaginable. PDFs, scanned images, photos taken on phones, and the occasional Word document.
Their operations team of three was opening each document individually, reading through it, typing key data into their system, cross-referencing it against the application, and flagging anything that didn't match. At around 60 applications per day, it was consuming roughly 15 hours of combined staff time daily.
Error rates were climbing. Processing times were stretching. And the team was burning out.
"We needed a system that could handle the messy reality of how documents actually arrive — not a solution that only works when everything is a perfectly formatted PDF."
What We Built
A six-stage pipeline that takes documents from arrival to validated, structured data — with humans only stepping in when the system isn't confident enough to proceed alone.
How It Works
Ingestion
Watchers on every inbound channel — email inbox, upload portal, shared Drive folder. Every document gets normalised: format conversion, image preprocessing (straighten, contrast correction, shadow removal), then queued for processing.
Custom image preprocessingClassification
AI vision model identifies document type (pay stub, bank statement, tax return, ID, etc.), language, and whether it's a digital original, scan, or photograph. Classification determines which extraction template runs next.
Intelligent Extraction
Each document type has a custom extraction schema. The AI returns structured JSON with every field plus a confidence score (0 to 1). Post-processing validates outputs — real dates, parseable amounts, internal consistency checks.
Custom prompts, validation & confidence scoringValidation & Cross-Referencing
Extracted data checked against the loan application: name matches, income alignment, employer verification, address consistency, duplicate detection. Each check produces pass/warning/fail.
Custom validation layerSmart Routing
High confidence + all validations passed → auto-approved, data populates the lending system. Medium confidence → Slack review queue (avg 90 seconds per review). Critical issues → escalated to senior reviewer.
Learning Loop
Every document logged with processing time, confidence scores, and human corrections. Weekly reports surface patterns. Prompts and thresholds tuned continuously — auto-approval rate improved from 48% to 79% in three months.
Why This Couldn't Be Drag-and-Drop
Standard workflow tools (n8n, Zapier, Make) handled about 30% of this pipeline — the triggers, routing, notifications, and scheduling. The remaining 70% required custom work:
Image preprocessing for cleaning up photos of crumpled documents taken under bad lighting. Extraction prompt engineering iterated over weeks per document type. Post-processing logic that catches when line items on a pay stub don't add up to the stated total. Multi-page handling for bank statements where transaction tables span pages. Confidence scoring tuned to the client's risk tolerance — for a lending platform, a wrong number isn't just an inconvenience, it's a compliance risk.
We used off-the-shelf tools where they made sense and wrote custom code where the problem demanded it. That's the difference between a demo and a production system.
Results
The Tech Stack
No new platforms for the team to learn. The system plugs into what they already use.
What Made This Work
Starting with the messy reality. We didn't build for perfectly formatted PDFs and hope for the best. We started with the worst-case documents — blurry photos, handwritten notes, multi-page scans — and built the system to handle those first. Everything else became easy by comparison.
Confidence scoring as the safety net. The system never guesses and hopes. Every extracted field carries a confidence score, and routing thresholds are tuned to the client's risk tolerance. For a lending platform, a wrong number isn't an inconvenience — it's a compliance risk.
Hybrid approach. Off-the-shelf workflow tools where they made sense. Custom code where the problem demanded it. This kept the build lean while handling complexity that drag-and-drop tools can't touch.
The feedback loop. The system got meaningfully better every week because we built measurement in from day one. Without tracking what humans corrected, there's no way to improve prompts and thresholds over time.
Engagement Timeline
Drowning in manual processes?
Let's talk about what a custom automation pipeline could free up for your team.
Get in touchThis case study represents a composite engagement based on real automation work. Client details have been anonymised.