PaperCut

🧠 PDF Matching Assistant – Hackathon Project

💡 Inspiration

At the TUM Hackathon, we collaborated with Beconex, a company that builds interfaces to SAP systems. They challenged us with a real-world business case: matching scanned bill pages from a multi-page PDF to structured goods receipt data in JSON—a task that’s deceptively complex due to OCR noise and inconsistent formatting.

This challenge inspired us to develop a lightweight yet powerful solution that blends classical text processing with modern AI techniques.

⸻

🔧 How We Built It 1. Text Extraction: We used PyPDF2 to extract searchable text from the OCR-processed PDF and transform it into a JSON structure. 2. Matching Strategy: • We applied Regular Expressions to identify relevant key fields. • Combined this with fuzzy search to handle OCR inconsistencies. • Each potential match received a confidence score based on string similarity. 3. Decision Logic: • If the score exceeds a threshold, we accept the match. • If it’s too low, we discard it and return an empty result to avoid false positives. 4. LLM Enhancement: To further increase accuracy, we integrated a locally hosted Large Language Model (LLM) to evaluate edge cases and handle document ambiguity.

⸻

🚧 Challenges We Faced • OCR Noise: Many PDFs contained errors due to low scan quality, making key fields hard to detect reliably. • Varying Data Formats: Not every document shared the same identifiers, so we had to create a flexible and fault-tolerant matching approach. • LLM Optimization: Hosting and tuning the local LLM came with resource limitations and performance tradeoffs. • Multi-Page Invoices: Bills spanning multiple pages were tricky to associate with a single JSON entry, so we had to carefully stitch and evaluate combined text sections.

⸻

🧠 What We Learned • How to combine traditional tools like Regex with modern AI solutions effectively. • Practical experience in handling real-world, messy data from scanned documents. • The importance of confidence scoring and fallback logic in automated document processing. • A deeper understanding of enterprise software needs, especially in the SAP ecosystem.

Built With

fuzzysearch
llm
pypdf2
python
regex

Updates

Paul Kao started this project — Jun 22, 2025 04:58 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.