We will be undergoing planned maintenance on January 16th, 2026 at 1:00pm UTC. Please make sure to save your work.

Inspiration

In legal tech, the bottleneck is often the discovery phase, where lawyers must gather and analyze hundreds of documents—depositions, contracts, emails, regulations, internal memos, expert reports—to build a comprehensive understanding of their case. Every research question meant:

  • Opening 5-10 different PDFs manually
  • Running Command+F searches that miss context
  • Cross-referencing between documents in separate windows
  • Taking notes by hand to track what's in each file
  • Spending hours just to answer simple questions

This could take lawyers weeks just organizing and reading through the document pile before even starting legal analysis. This is insane as 90% of their time was wasted on mechanical document retrieval, not actual legal thinking.

What if an AI could instantly answer questions across this entire document collection, but with lawyer-level precision? Not vague summaries, but actual citations with exact page numbers and highlighted text, so you can verify every claim.

Traditional RAG systems weren't good enough. They cite "Document A, page 3" but won't show you where on page 3 or what specific passage supports the answer. Legal work demands precision and zero hallucinations, so I set out to to build a system that could:

  • Deep-link to exact highlighted passages with bounding boxes
  • Handle scanned PDFs and preserve document structure
  • Cross-reference across hundreds of documents instantly
  • Let lawyers ask follow-up questions like they're talking to a research assistant

ClauseSearch turns weeks of discovery drudgery into minutes of targeted Q&A.

What it does

ClauseSearch is a three-pillar AI workbench for legal document analysis:

1. Discover: Conversational Q&A with Deep Citations

Ask natural language questions and get AI-generated answers with inline citation boxes. Hover over any citation to see:

  • Document name & page number
  • Exact snippet with context

Example: "What are high-risk AI systems under the EU AI Act?" → Answer with [1] [2] citations that deep-link to highlighted paragraphs in the regulation PDF

2. Vault: Smart Document Ingestion Pipeline

Upload PDFs and the system:

  • OCR Processing with Google Document AI (handles scanned docs + preserves layout)
  • Smart Chunking with token-based overlap while preserving page/bbox metadata
  • Hybrid Search - BM25 (keyword) + kNN (semantic) with RRF fusion in Elasticsearch
  • Embeddings via Vertex AI (text-embedding-004) for semantic understanding
  • Metadata Storage in Firestore + PostgreSQL for traceability

Supports batch uploads and handles large PDFs by automatically splitting into 10-page chunks for processing.

3. Table: AI-Powered Batch Analysis

Transform your document vault into a structured spreadsheet with AI:

  • Template Analysis - One-click extraction of metadata across all documents (Date, Document Type, Summary, Author, Persons Mentioned, Language)
  • Custom Columns - Ask any question and the AI answers it for every document
    • Example: "Does this document mention GDPR Article 6?" → Yes/No + explanation for 100 documents
  • Background Jobs with real-time progress tracking
  • Export-Ready data for discovery reviews and compliance audits

How I built it

Architecture: Hybrid Multi-Agent System

Backend (Python + FastAPI)

┌─────────────┐
│   Upload    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Document AI Agent (GCP)            │
│  - OCR with layout preservation     │
│  - Auto-split PDFs >10 pages        │
│  - Bounding box extraction          │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Chunking Agent                     │
│  - Token-based (1000 tokens/chunk)  │
│  - 200-token overlap                │
│  - Preserves page + bbox mapping    │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Embedding Agent (Vertex AI)        │
│  - Batch embeddings (768-dim)       │
│  - text-embedding-004 model         │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Indexing Agent (Elasticsearch)     │
│  - Dense vector index (kNN)         │
│  - Full-text index (BM25)           │
│  - Span map storage (Firestore)     │
└─────────────────────────────────────┘

         Query Flow
         ─────────

User Query → Embedding → Hybrid Search
              (BM25 + kNN + RRF)
                   ↓
            Context Assembly
            (Top-k chunks with
             citations [1][2][3])
                   ↓
           LLM Agent (Gemini 1.5 Pro)
           System: "You are a legal AI.
                   Answer ONLY from context.
                   Use [1][2] citations."
                   ↓
           Answer with Citations
           + Bounding Box Metadata

Key Services:

  • ingestion.py - Orchestrates the 4-step pipeline (OCR → Chunk → Embed → Index)
  • elasticsearch.py - Hybrid search with RRF (Reciprocal Rank Fusion)
  • vertex_ai.py - Gemini 1.5 Pro + text-embedding-004
  • table_analysis.py - Batch processing with background jobs

Frontend (Next.js 15 + TypeScript)

  • ShadCN UI + Tailwind for polished components
  • Radix UI Tooltips for citation previews
  • ReactMarkdown with custom citation rendering
  • PostgreSQL (Prisma) for vault/project management
  • Real-time updates for batch job progress

Tech Stack

Layer Technology
LLM Gemini 1.5 Pro (Vertex AI)
Embeddings text-embedding-004 (768-dim)
Search Elasticsearch (Hybrid BM25 + kNN)
OCR Google Document AI
Storage GCS (raw PDFs) + Firestore (metadata) + PostgreSQL
Backend FastAPI + Python 3.13
Frontend Next.js 15 + TypeScript + Bun
Deployment Cloud Run (backend) + Vercel (frontend)

Challenges I ran into

1. Document AI's 10-Page Limit

Google Document AI restricts synchronous processing to 10 pages. For a 50-page regulation PDF, this was a dealbreaker.

Solution: Built automatic PDF splitting with PyPDF2. The ingestion pipeline now:

  1. Checks page count with PdfReader
  2. Splits into 10-page chunks
  3. Processes each chunk with Document AI
  4. Merges results and adjusts page numbers
  5. Cleans up temporary files

Code:

if total_pages > 10:
    for chunk_start in range(0, total_pages, 10):
        # Extract 10 pages
        writer = PdfWriter()
        for page_num in range(chunk_start, min(chunk_start + 10, total_pages)):
            writer.add_page(reader.pages[page_num])

        # Process chunk
        chunk_result = self.doc_ai.process_document(chunk_gcs_uri, mime_type)

        # Adjust page numbers: page_number += chunk_start

2. Elasticsearch Hybrid Search Tuning

Combining BM25 (keyword) and kNN (semantic) required careful tuning. Pure semantic search missed exact legal terms; pure keyword search missed context.

Solution: Implemented Reciprocal Rank Fusion (RRF):

def hybrid_search(query_text, query_vector, k=5, num_candidates=50):
    # BM25 search
    bm25_results = text_search(query_text, k=num_candidates)

    # kNN search
    knn_results = vector_search(query_vector, k=num_candidates)

    # RRF fusion (rank-based scoring)
    for rank, hit in enumerate(bm25_results):
        scores[hit.id] += 1 / (rank + 60)  # 60 = rank constant
    for rank, hit in enumerate(knn_results):
        scores[hit.id] += 1 / (rank + 60)

    # Re-rank by combined score
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]

Testing on the EU AI Act demo showed 40% better precision vs. pure semantic search.

4. Bounding Box Coordinate Mapping

Document AI returns character-level bounding boxes. Chunks span multiple lines/pages, so mapping chunk text back to precise PDF coordinates was complex.

Solution: Built a character-to-location map:

char_map = {}  # {char_idx: {page: int, bbox: [x1,y1,x2,y2]}}

for page in layout_data["pages"]:
    for token in page["tokens"]:
        for char_idx in range(token["char_start"], token["char_end"]):
            char_map[char_idx] = {
                "page": page["page_number"],
                "bbox": token["bbox"]
            }

# For each chunk, sample every 10th character
for char_idx in range(chunk_start, chunk_end, 10):
    location = char_map[char_idx]
    bbox_list.append(location["bbox"])

Stored in Firestore as "span maps" for instant retrieval during citation rendering.

Accomplishments that I'm proud of

Correct Citation System

Building a citation system that goes beyond "page 3" to show exact highlighted passages with bounding boxes. This is the feature that makes ClauseSearch ready for legal work as lawyers can always cross check to ensure AI does not hallucinate.

Sub-2-Second Query Times

Optimized the entire pipeline to answer complex questions in <2 seconds:

  • Embedding: ~300ms
  • Hybrid search: ~400ms
  • LLM generation: ~1000ms
  • Total: ~1.7s average

Smart Chunking with Layout Preservation

Most RAG systems chunk text naively and lose page/bbox metadata. My chunking agent:

  • Uses token-based chunking (tiktoken) for LLM compatibility
  • Preserves character-to-bbox mapping throughout
  • Handles overlap intelligently (200-token sliding window)
  • Stores span maps for instant citation highlighting

What I learned

Hybrid Search is Non-Negotiable for Legal

I initially built pure semantic search and got burned. Queries like "GDPR Article 6" failed because the embedding model treated "Article 6" generically. Adding BM25 with RRF fusion fixed precision issues perfectly.

Chunking Strategy Makes or Breaks RAG

I tried:

  1. Sentence-based chunking - Too small, lost context (Failed)
  2. Fixed character chunks - Split mid-sentence, garbage embeddings (Failed)
  3. Token-based with overlap - Clean boundaries, LLM-compatible, preserves context (Succeeded)

Key insight: Always use tiktoken for chunking—it matches how the LLM tokenizes text.

LLM Citation Hallucination is Real

Gemini would confidently cite [7] when only [1-5] existed. I fixed this by:

  • Using strict system prompts: "Context has citations [1] through [5]. Use ONLY these numbers."
  • Post-processing to validate citation numbers against actual context
  • Frontend re-numbering to ensure sequential [1][2][3] display

Google Cloud Services Play Together Beautifully

Document AI → GCS → Firestore → Vertex AI formed a seamless pipeline. No data marshaling between cloud providers, built-in auth, and VertexAI's Gemini integration was chef's kiss.

What's next for ClauseSearch

1. Multi-Modal Document Understanding Add vision models (Gemini 1.5 Pro Vision) to:

  • Extract data from scanned tables and charts
  • Understand diagrams and flowcharts in technical specs
  • Process handwritten notes from depositions

2. Comparative Analysis Agent Build an agent that compares documents side-by-side:

  • "How does TechNova's AI policy differ from the EU AI Act requirements?"
  • Generate diff tables highlighting gaps and compliance issues
  • Auto-detect contradictions between internal docs and external regulations

3. Redaction & Privacy Agent Add PII detection and auto-redaction:

  • Highlight names, SSNs, emails before export
  • One-click redaction for court filings
  • GDPR compliance mode (auto-detect personal data)

4. Legal Template Library Pre-built analysis templates:

  • Contract Review (obligations, termination clauses, liability)
  • Due Diligence (risk factors, compliance gaps)
  • eDiscovery (responsive docs, privilege review)
  • Regulatory Compliance (requirement mapping)

Built With

Share this project:

Updates