Inspiration
In legal tech, the bottleneck is often the discovery phase, where lawyers must gather and analyze hundreds of documents—depositions, contracts, emails, regulations, internal memos, expert reports—to build a comprehensive understanding of their case. Every research question meant:
- Opening 5-10 different PDFs manually
- Running Command+F searches that miss context
- Cross-referencing between documents in separate windows
- Taking notes by hand to track what's in each file
- Spending hours just to answer simple questions
This could take lawyers weeks just organizing and reading through the document pile before even starting legal analysis. This is insane as 90% of their time was wasted on mechanical document retrieval, not actual legal thinking.
What if an AI could instantly answer questions across this entire document collection, but with lawyer-level precision? Not vague summaries, but actual citations with exact page numbers and highlighted text, so you can verify every claim.
Traditional RAG systems weren't good enough. They cite "Document A, page 3" but won't show you where on page 3 or what specific passage supports the answer. Legal work demands precision and zero hallucinations, so I set out to to build a system that could:
- Deep-link to exact highlighted passages with bounding boxes
- Handle scanned PDFs and preserve document structure
- Cross-reference across hundreds of documents instantly
- Let lawyers ask follow-up questions like they're talking to a research assistant
ClauseSearch turns weeks of discovery drudgery into minutes of targeted Q&A.
What it does
ClauseSearch is a three-pillar AI workbench for legal document analysis:
1. Discover: Conversational Q&A with Deep Citations
Ask natural language questions and get AI-generated answers with inline citation boxes. Hover over any citation to see:
- Document name & page number
- Exact snippet with context
Example: "What are high-risk AI systems under the EU AI Act?"
→ Answer with [1] [2] citations that deep-link to highlighted paragraphs in the regulation PDF
2. Vault: Smart Document Ingestion Pipeline
Upload PDFs and the system:
- OCR Processing with Google Document AI (handles scanned docs + preserves layout)
- Smart Chunking with token-based overlap while preserving page/bbox metadata
- Hybrid Search - BM25 (keyword) + kNN (semantic) with RRF fusion in Elasticsearch
- Embeddings via Vertex AI (text-embedding-004) for semantic understanding
- Metadata Storage in Firestore + PostgreSQL for traceability
Supports batch uploads and handles large PDFs by automatically splitting into 10-page chunks for processing.
3. Table: AI-Powered Batch Analysis
Transform your document vault into a structured spreadsheet with AI:
- Template Analysis - One-click extraction of metadata across all documents (Date, Document Type, Summary, Author, Persons Mentioned, Language)
- Custom Columns - Ask any question and the AI answers it for every document
- Example: "Does this document mention GDPR Article 6?" → Yes/No + explanation for 100 documents
- Background Jobs with real-time progress tracking
- Export-Ready data for discovery reviews and compliance audits
How I built it
Architecture: Hybrid Multi-Agent System
Backend (Python + FastAPI)
┌─────────────┐
│ Upload │
└──────┬──────┘
│
▼
┌─────────────────────────────────────┐
│ Document AI Agent (GCP) │
│ - OCR with layout preservation │
│ - Auto-split PDFs >10 pages │
│ - Bounding box extraction │
└──────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Chunking Agent │
│ - Token-based (1000 tokens/chunk) │
│ - 200-token overlap │
│ - Preserves page + bbox mapping │
└──────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Embedding Agent (Vertex AI) │
│ - Batch embeddings (768-dim) │
│ - text-embedding-004 model │
└──────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Indexing Agent (Elasticsearch) │
│ - Dense vector index (kNN) │
│ - Full-text index (BM25) │
│ - Span map storage (Firestore) │
└─────────────────────────────────────┘
Query Flow
─────────
User Query → Embedding → Hybrid Search
(BM25 + kNN + RRF)
↓
Context Assembly
(Top-k chunks with
citations [1][2][3])
↓
LLM Agent (Gemini 1.5 Pro)
System: "You are a legal AI.
Answer ONLY from context.
Use [1][2] citations."
↓
Answer with Citations
+ Bounding Box Metadata
Key Services:
ingestion.py- Orchestrates the 4-step pipeline (OCR → Chunk → Embed → Index)elasticsearch.py- Hybrid search with RRF (Reciprocal Rank Fusion)vertex_ai.py- Gemini 1.5 Pro + text-embedding-004table_analysis.py- Batch processing with background jobs
Frontend (Next.js 15 + TypeScript)
- ShadCN UI + Tailwind for polished components
- Radix UI Tooltips for citation previews
- ReactMarkdown with custom citation rendering
- PostgreSQL (Prisma) for vault/project management
- Real-time updates for batch job progress
Tech Stack
| Layer | Technology |
|---|---|
| LLM | Gemini 1.5 Pro (Vertex AI) |
| Embeddings | text-embedding-004 (768-dim) |
| Search | Elasticsearch (Hybrid BM25 + kNN) |
| OCR | Google Document AI |
| Storage | GCS (raw PDFs) + Firestore (metadata) + PostgreSQL |
| Backend | FastAPI + Python 3.13 |
| Frontend | Next.js 15 + TypeScript + Bun |
| Deployment | Cloud Run (backend) + Vercel (frontend) |
Challenges I ran into
1. Document AI's 10-Page Limit
Google Document AI restricts synchronous processing to 10 pages. For a 50-page regulation PDF, this was a dealbreaker.
Solution: Built automatic PDF splitting with PyPDF2. The ingestion pipeline now:
- Checks page count with
PdfReader - Splits into 10-page chunks
- Processes each chunk with Document AI
- Merges results and adjusts page numbers
- Cleans up temporary files
Code:
if total_pages > 10:
for chunk_start in range(0, total_pages, 10):
# Extract 10 pages
writer = PdfWriter()
for page_num in range(chunk_start, min(chunk_start + 10, total_pages)):
writer.add_page(reader.pages[page_num])
# Process chunk
chunk_result = self.doc_ai.process_document(chunk_gcs_uri, mime_type)
# Adjust page numbers: page_number += chunk_start
2. Elasticsearch Hybrid Search Tuning
Combining BM25 (keyword) and kNN (semantic) required careful tuning. Pure semantic search missed exact legal terms; pure keyword search missed context.
Solution: Implemented Reciprocal Rank Fusion (RRF):
def hybrid_search(query_text, query_vector, k=5, num_candidates=50):
# BM25 search
bm25_results = text_search(query_text, k=num_candidates)
# kNN search
knn_results = vector_search(query_vector, k=num_candidates)
# RRF fusion (rank-based scoring)
for rank, hit in enumerate(bm25_results):
scores[hit.id] += 1 / (rank + 60) # 60 = rank constant
for rank, hit in enumerate(knn_results):
scores[hit.id] += 1 / (rank + 60)
# Re-rank by combined score
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]
Testing on the EU AI Act demo showed 40% better precision vs. pure semantic search.
4. Bounding Box Coordinate Mapping
Document AI returns character-level bounding boxes. Chunks span multiple lines/pages, so mapping chunk text back to precise PDF coordinates was complex.
Solution: Built a character-to-location map:
char_map = {} # {char_idx: {page: int, bbox: [x1,y1,x2,y2]}}
for page in layout_data["pages"]:
for token in page["tokens"]:
for char_idx in range(token["char_start"], token["char_end"]):
char_map[char_idx] = {
"page": page["page_number"],
"bbox": token["bbox"]
}
# For each chunk, sample every 10th character
for char_idx in range(chunk_start, chunk_end, 10):
location = char_map[char_idx]
bbox_list.append(location["bbox"])
Stored in Firestore as "span maps" for instant retrieval during citation rendering.
Accomplishments that I'm proud of
Correct Citation System
Building a citation system that goes beyond "page 3" to show exact highlighted passages with bounding boxes. This is the feature that makes ClauseSearch ready for legal work as lawyers can always cross check to ensure AI does not hallucinate.
Sub-2-Second Query Times
Optimized the entire pipeline to answer complex questions in <2 seconds:
- Embedding: ~300ms
- Hybrid search: ~400ms
- LLM generation: ~1000ms
- Total: ~1.7s average
Smart Chunking with Layout Preservation
Most RAG systems chunk text naively and lose page/bbox metadata. My chunking agent:
- Uses token-based chunking (tiktoken) for LLM compatibility
- Preserves character-to-bbox mapping throughout
- Handles overlap intelligently (200-token sliding window)
- Stores span maps for instant citation highlighting
What I learned
Hybrid Search is Non-Negotiable for Legal
I initially built pure semantic search and got burned. Queries like "GDPR Article 6" failed because the embedding model treated "Article 6" generically. Adding BM25 with RRF fusion fixed precision issues perfectly.
Chunking Strategy Makes or Breaks RAG
I tried:
- Sentence-based chunking - Too small, lost context (Failed)
- Fixed character chunks - Split mid-sentence, garbage embeddings (Failed)
- Token-based with overlap - Clean boundaries, LLM-compatible, preserves context (Succeeded)
Key insight: Always use tiktoken for chunking—it matches how the LLM tokenizes text.
LLM Citation Hallucination is Real
Gemini would confidently cite [7] when only [1-5] existed. I fixed this by:
- Using strict system prompts: "Context has citations [1] through [5]. Use ONLY these numbers."
- Post-processing to validate citation numbers against actual context
- Frontend re-numbering to ensure sequential [1][2][3] display
Google Cloud Services Play Together Beautifully
Document AI → GCS → Firestore → Vertex AI formed a seamless pipeline. No data marshaling between cloud providers, built-in auth, and VertexAI's Gemini integration was chef's kiss.
What's next for ClauseSearch
1. Multi-Modal Document Understanding Add vision models (Gemini 1.5 Pro Vision) to:
- Extract data from scanned tables and charts
- Understand diagrams and flowcharts in technical specs
- Process handwritten notes from depositions
2. Comparative Analysis Agent Build an agent that compares documents side-by-side:
- "How does TechNova's AI policy differ from the EU AI Act requirements?"
- Generate diff tables highlighting gaps and compliance issues
- Auto-detect contradictions between internal docs and external regulations
3. Redaction & Privacy Agent Add PII detection and auto-redaction:
- Highlight names, SSNs, emails before export
- One-click redaction for court filings
- GDPR compliance mode (auto-detect personal data)
4. Legal Template Library Pre-built analysis templates:
- Contract Review (obligations, termination clauses, liability)
- Due Diligence (risk factors, compliance gaps)
- eDiscovery (responsive docs, privilege review)
- Regulatory Compliance (requirement mapping)
Built With
- clerk
- cloud-run
- cloud-sql
- cloud-storage
- document-ai
- elastic-search
- firestore
- gemini
- google-cloud
- vertex-ai
Log in or sign up for Devpost to join the conversation.