ClauseSearch

Inspiration

In legal tech, the bottleneck is often the discovery phase, where lawyers must gather and analyze hundreds of documents—depositions, contracts, emails, regulations, internal memos, expert reports—to build a comprehensive understanding of their case. Every research question meant:

Opening 5-10 different PDFs manually
Running Command+F searches that miss context
Cross-referencing between documents in separate windows
Taking notes by hand to track what's in each file
Spending hours just to answer simple questions

This could take lawyers weeks just organizing and reading through the document pile before even starting legal analysis. This is insane as 90% of their time was wasted on mechanical document retrieval, not actual legal thinking.

What if an AI could instantly answer questions across this entire document collection, but with lawyer-level precision? Not vague summaries, but actual citations with exact page numbers and highlighted text, so you can verify every claim.

Traditional RAG systems weren't good enough. They cite "Document A, page 3" but won't show you where on page 3 or what specific passage supports the answer. Legal work demands precision and zero hallucinations, so I set out to to build a system that could:

Deep-link to exact highlighted passages with bounding boxes
Handle scanned PDFs and preserve document structure
Cross-reference across hundreds of documents instantly
Let lawyers ask follow-up questions like they're talking to a research assistant

ClauseSearch turns weeks of discovery drudgery into minutes of targeted Q&A.

What it does

ClauseSearch is a three-pillar AI workbench for legal document analysis:

1. Discover: Conversational Q&A with Deep Citations

Ask natural language questions and get AI-generated answers with inline citation boxes. Hover over any citation to see:

Document name & page number
Exact snippet with context

Example: "What are high-risk AI systems under the EU AI Act?" → Answer with [1] [2] citations that deep-link to highlighted paragraphs in the regulation PDF

2. Vault: Smart Document Ingestion Pipeline

Upload PDFs and the system:

OCR Processing with Google Document AI (handles scanned docs + preserves layout)
Smart Chunking with token-based overlap while preserving page/bbox metadata
Hybrid Search - BM25 (keyword) + kNN (semantic) with RRF fusion in Elasticsearch
Embeddings via Vertex AI (text-embedding-004) for semantic understanding
Metadata Storage in Firestore + PostgreSQL for traceability

Supports batch uploads and handles large PDFs by automatically splitting into 10-page chunks for processing.

3. Table: AI-Powered Batch Analysis

Transform your document vault into a structured spreadsheet with AI:

Template Analysis - One-click extraction of metadata across all documents (Date, Document Type, Summary, Author, Persons Mentioned, Language)
Custom Columns - Ask any question and the AI answers it for every document
- Example: "Does this document mention GDPR Article 6?" → Yes/No + explanation for 100 documents
Background Jobs with real-time progress tracking
Export-Ready data for discovery reviews and compliance audits

How I built it

Architecture: Hybrid Multi-Agent System

Backend (Python + FastAPI)

┌─────────────┐
│   Upload    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Document AI Agent (GCP)            │
│  - OCR with layout preservation     │
│  - Auto-split PDFs >10 pages        │
│  - Bounding box extraction          │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Chunking Agent                     │
│  - Token-based (1000 tokens/chunk)  │
│  - 200-token overlap                │
│  - Preserves page + bbox mapping    │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Embedding Agent (Vertex AI)        │
│  - Batch embeddings (768-dim)       │
│  - text-embedding-004 model         │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Indexing Agent (Elasticsearch)     │
│  - Dense vector index (kNN)         │
│  - Full-text index (BM25)           │
│  - Span map storage (Firestore)     │
└─────────────────────────────────────┘

         Query Flow
         ─────────

User Query → Embedding → Hybrid Search
              (BM25 + kNN + RRF)
                   ↓
            Context Assembly
            (Top-k chunks with
             citations [1][2][3])
                   ↓
           LLM Agent (Gemini 1.5 Pro)
           System: "You are a legal AI.
                   Answer ONLY from context.
                   Use [1][2] citations."
                   ↓
           Answer with Citations
           + Bounding Box Metadata

Key Services:

ingestion.py - Orchestrates the 4-step pipeline (OCR → Chunk → Embed → Index)
elasticsearch.py - Hybrid search with RRF (Reciprocal Rank Fusion)
vertex_ai.py - Gemini 1.5 Pro + text-embedding-004
table_analysis.py - Batch processing with background jobs

Frontend (Next.js 15 + TypeScript)

ShadCN UI + Tailwind for polished components
Radix UI Tooltips for citation previews
ReactMarkdown with custom citation rendering
PostgreSQL (Prisma) for vault/project management
Real-time updates for batch job progress

Tech Stack

Layer	Technology
LLM	Gemini 1.5 Pro (Vertex AI)
Embeddings	text-embedding-004 (768-dim)
Search	Elasticsearch (Hybrid BM25 + kNN)
OCR	Google Document AI
Storage	GCS (raw PDFs) + Firestore (metadata) + PostgreSQL
Backend	FastAPI + Python 3.13
Frontend	Next.js 15 + TypeScript + Bun
Deployment	Cloud Run (backend) + Vercel (frontend)

Challenges I ran into

1. Document AI's 10-Page Limit

Google Document AI restricts synchronous processing to 10 pages. For a 50-page regulation PDF, this was a dealbreaker.

Solution: Built automatic PDF splitting with PyPDF2. The ingestion pipeline now:

Checks page count with PdfReader
Splits into 10-page chunks
Processes each chunk with Document AI
Merges results and adjusts page numbers
Cleans up temporary files

Code:

if total_pages > 10:
    for chunk_start in range(0, total_pages, 10):
        # Extract 10 pages
        writer = PdfWriter()
        for page_num in range(chunk_start, min(chunk_start + 10, total_pages)):
            writer.add_page(reader.pages[page_num])

        # Process chunk
        chunk_result = self.doc_ai.process_document(chunk_gcs_uri, mime_type)

        # Adjust page numbers: page_number += chunk_start

2. Elasticsearch Hybrid Search Tuning

Combining BM25 (keyword) and kNN (semantic) required careful tuning. Pure semantic search missed exact legal terms; pure keyword search missed context.

Solution: Implemented Reciprocal Rank Fusion (RRF):

def hybrid_search(query_text, query_vector, k=5, num_candidates=50):
    # BM25 search
    bm25_results = text_search(query_text, k=num_candidates)

    # kNN search
    knn_results = vector_search(query_vector, k=num_candidates)

    # RRF fusion (rank-based scoring)
    for rank, hit in enumerate(bm25_results):
        scores[hit.id] += 1 / (rank + 60)  # 60 = rank constant
    for rank, hit in enumerate(knn_results):
        scores[hit.id] += 1 / (rank + 60)

    # Re-rank by combined score
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]

Testing on the EU AI Act demo showed 40% better precision vs. pure semantic search.

4. Bounding Box Coordinate Mapping

Document AI returns character-level bounding boxes. Chunks span multiple lines/pages, so mapping chunk text back to precise PDF coordinates was complex.

Solution: Built a character-to-location map:

char_map = {}  # {char_idx: {page: int, bbox: [x1,y1,x2,y2]}}

for page in layout_data["pages"]:
    for token in page["tokens"]:
        for char_idx in range(token["char_start"], token["char_end"]):
            char_map[char_idx] = {
                "page": page["page_number"],
                "bbox": token["bbox"]
            }

# For each chunk, sample every 10th character
for char_idx in range(chunk_start, chunk_end, 10):
    location = char_map[char_idx]
    bbox_list.append(location["bbox"])

Stored in Firestore as "span maps" for instant retrieval during citation rendering.

Accomplishments that I'm proud of

Correct Citation System

Building a citation system that goes beyond "page 3" to show exact highlighted passages with bounding boxes. This is the feature that makes ClauseSearch ready for legal work as lawyers can always cross check to ensure AI does not hallucinate.

Sub-2-Second Query Times

Optimized the entire pipeline to answer complex questions in <2 seconds:

Embedding: ~300ms
Hybrid search: ~400ms
LLM generation: ~1000ms
Total: ~1.7s average

Smart Chunking with Layout Preservation

Most RAG systems chunk text naively and lose page/bbox metadata. My chunking agent:

Uses token-based chunking (tiktoken) for LLM compatibility
Preserves character-to-bbox mapping throughout
Handles overlap intelligently (200-token sliding window)
Stores span maps for instant citation highlighting

What I learned

Hybrid Search is Non-Negotiable for Legal

I initially built pure semantic search and got burned. Queries like "GDPR Article 6" failed because the embedding model treated "Article 6" generically. Adding BM25 with RRF fusion fixed precision issues perfectly.

Chunking Strategy Makes or Breaks RAG

I tried:

Sentence-based chunking - Too small, lost context (Failed)
Fixed character chunks - Split mid-sentence, garbage embeddings (Failed)
Token-based with overlap - Clean boundaries, LLM-compatible, preserves context (Succeeded)

Key insight: Always use tiktoken for chunking—it matches how the LLM tokenizes text.

LLM Citation Hallucination is Real

Gemini would confidently cite [7] when only [1-5] existed. I fixed this by:

Using strict system prompts: "Context has citations [1] through [5]. Use ONLY these numbers."
Post-processing to validate citation numbers against actual context
Frontend re-numbering to ensure sequential [1][2][3] display

Google Cloud Services Play Together Beautifully

Document AI → GCS → Firestore → Vertex AI formed a seamless pipeline. No data marshaling between cloud providers, built-in auth, and VertexAI's Gemini integration was chef's kiss.

What's next for ClauseSearch

1. Multi-Modal Document Understanding Add vision models (Gemini 1.5 Pro Vision) to:

Extract data from scanned tables and charts
Understand diagrams and flowcharts in technical specs
Process handwritten notes from depositions

2. Comparative Analysis Agent Build an agent that compares documents side-by-side:

"How does TechNova's AI policy differ from the EU AI Act requirements?"
Generate diff tables highlighting gaps and compliance issues
Auto-detect contradictions between internal docs and external regulations

3. Redaction & Privacy Agent Add PII detection and auto-redaction:

Highlight names, SSNs, emails before export
One-click redaction for court filings
GDPR compliance mode (auto-detect personal data)

4. Legal Template Library Pre-built analysis templates:

Contract Review (obligations, termination clauses, liability)
Due Diligence (risk factors, compliance gaps)
eDiscovery (responsive docs, privilege review)
Regulatory Compliance (requirement mapping)

Built With

clerk
cloud-run
cloud-sql
cloud-storage
document-ai
elastic-search
firestore
gemini
google-cloud
vertex-ai

Updates

Anh Lam started this project — Oct 24, 2025 03:24 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.