Hitachi DocGuard AI

Inspiration In today’s data-driven enterprise, organizations are drowning in unstructured documents—PDFs, slide decks, and mixed-media files. Manual classification for compliance is slow, expensive, and prone to human error. A single sensitive chart buried on page 42 of a "public" report can lead to catastrophic data leaks or severe regulatory fines.

We realized that standard OCR and keyword matching aren't enough. We needed an intelligent "gatekeeper"—a system that doesn't just read text, but understands visual context, nuances in language, and complex business rules. Hitachi DocGuard AI was born from the need to automate this critical security layer, ensuring that no sensitive data leaves the organization unchecked, while keeping compliant workflows moving efficiently.

What it does Hitachi DocGuard AI is an intelligent, multi-modal document security platform that automatically analyzes and classifies files into Public, Confidential, Highly Sensitive, or Unsafe categories.

Key capabilities include:

Multi-Modal Analysis: It reads text, interprets charts, and analyzes images across multi-page documents to understand complete context.

Dynamic Classification: Uses a configurable prompt library to generate "prompt trees" that adapt their questioning based on initial findings.

Evidence-Based Reporting: It doesn’t just give a label; it provides citations, pointing users to the exact page, paragraph, or image that triggered a specific classification.

Human-in-the-Loop (HITL): Low-confidence classifications are flagged for human review, and verified feedback is used to refine future prompt trees.

Dual-LLM Verification (Optional): For ultra-high security settings, two separate LLMs can cross-verify classifications to minimize false positives before human intervention is needed.

Pre-processing & Batching: Handles large-scale ingestion with automated checks for file legibility and real-time status updates.

How we built it We built DocGuard AI with a focus on modularity and auditability.

Frontend: We designed a business-friendly UI using [e.g., React/Next.js] focused on clear visualizations of classification confidence and easy access to audit trails.

Core Engine: The backend is powered by [e.g., Python/FastAPI], orchestrating document processing pipelines.

AI Layer: We leveraged [e.g., LangChain/LlamaIndex] to manage our dynamic prompt trees. We utilize state-of-the-art multi-modal models (like [e.g., GPT-4o, Gemini 1.5 Pro, or Claude 3]) to analyze both visual and textual data simultaneously.

Data & Security: All processing metadata and audit logs are matched securely in [e.g., PostgreSQL/MongoDB], ensuring strict data privacy compliance.

Challenges we ran into Context Window Limitations: Analyzing massive 100+ page PDFs with mixed media initially strained standard context windows. We had to implement smart chunking strategies that preserved semantic context across page breaks.

Defining "Nuance": Teaching the AI the difference between "Confidential" internal strategy and "Highly Sensitive" PII was difficult. We solved this by building the "Dynamic Prompt Tree," allowing the AI to ask follow-up questions to itself when it detected borderline content.

Balancing Speed vs. Accuracy: Dual-LLM verification is highly accurate but slow. We had to engineer an async batch processing system so users weren't stuck watching loading screens for large uploads.

Accomplishments that we're proud of The "Citation Engine": We are incredibly proud of the feature that highlights exactly why a document was flagged. Seeing the AI point to a specific blurry screenshot of a spreadsheet on page 15 as evidence for a "Highly Sensitive" tag feels like magic.

Effective HITL Workflow: We successfully built a feedback loop where human corrections actually update the operational parameters, making the system smarter over time without requiring full code redeployments.

Enterprise-Grade UI: Moving beyond a simple tech demo to a dashboard that looks and feels like a tool a compliance officer would actually use daily.

What we learned Multi-modal is a must: Text-only analysis misses roughly 30% of sensitive context in modern business presentations (like screenshots of dashboards).

Trust requires transparency: Users didn't trust the "black box" AI until we added the detailed evidence citations. Once they could see the why, adoption became much easier.

What's next for Hitachi DocGuard AI Automated Redaction: Moving from just classifying sensitive data to automatically redacting it upon user approval.

Industry-Specific Modules: Creating pre-tuned prompt libraries for Healthcare (HIPAA) and Finance (GDPR/SOX) verticals.

Direct Integration: Building connectors for standard enterprise storage (SharePoint, Box, Google Workspace) to scan files already at rest.

Built With

chatgpt
databricks
gemini
python
react

Updates

Ryan Soriano started this project — Nov 09, 2025 01:16 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.