ClearClause — Upload any contract and get an instant AI-powered legal risk briefing
Real-time pipeline progress via Server-Sent Events : watch OCR, PII redaction, and AI analysis happen live
Every clause classified by risk : Rights Given Up, One-Sided Terms, Financial Impact, Missing Protections
Your original document with color-coded clause highlights — click any card to jump to its exact location
Ask questions about your document in plain English — get instant answers with voice input and audio output
Privacy-first: 24 personal data items automatically redacted before AI analysis — the LLM never sees your raw data
Quantifiable fairness scoring — see exactly where your contract falls below industry standard
Download your annotated contract — keep a permanent copy with all flagged clauses highlighted by risk category

ClearClause: Know What You're Signing. Before You Sign It.

Inspiration

In 2023, a family in Texas filed a health insurance claim for their daughter's asthma treatment. It was denied.

The reason: a clause on page 23, paragraph 4(b), buried in a rider appendix — "Coverage excludes any condition diagnosed within the prior 60 months." They had been paying premiums for three years. They never knew.

This isn't an edge case. This is how the system works.

Health insurance policies bury coverage exclusions, pre-authorisation requirements, and out-of-network penalties deep in appendices — discovered only when you're already sick and it's too late to switch.

Employment contracts embed non-compete clauses that block you from working in your field for two years, IP assignment clauses that claim ownership of your side projects, and forced arbitration that strips your right to sue.

Rental leases include 24-hour landlord entry rights, automatic rent escalation, and liability waivers that remove protections tenants assumed they had.

The legal system is designed for people who can afford lawyers. ClearClause is designed for everyone else.

I started building this after signing an internship offer without fully understanding the IP assignment clause. It claimed ownership over anything I built — even on my own time, on my own laptop. I realised it months later. That moment stuck: a 30-second scan by the right tool would have caught it instantly.

What It Does

Upload any PDF — an insurance policy, employment contract, lease, NDA, or terms of service — and ClearClause delivers a complete legal risk briefing in under 60 seconds.

The Pipeline

1. OCR Extraction : Apryse SDK extracts every word with precise bounding-box positions. This enables the annotated viewer to highlight the exact clause in your original document — not a keyword match, the actual span.

2. PII Shield — Privacy Before AI: Before any AI model sees your document, a dual-engine PII detection system (Microsoft Presidio NER + regex fallback) redacts SSNs, emails, phone numbers, credit card numbers, names, addresses, and dates of birth. For insurance and medical documents, this is non-negotiable. The LLM never sees your raw personal data.

3. Clause-by-Clause Risk Classification: Gemini 3.1 Pro analyses every clause and assigns it a category, severity, plain-English explanation, one-sentence summary, and fairness rating against industry norms:

Icon	Category	What it means
🔴	Rights Given Up	Things you're agreeing to surrender
🟠	One-Sided Terms	Clauses that heavily favor the other party
🟡	Financial Impact	Hidden costs, penalties, escalation clauses
🔵	Missing Protections	Standard protections absent from your document
🟢	Standard	Fair, commonly-seen terms

4. Fairness Score (0–100): A computed score tells you at a glance how balanced the document is, with side-by-side "Your Document vs. Industry Standard" comparisons for every non-standard clause, plus actionable negotiation suggestions.

5. Annotated PDF Viewer: Your original document rendered in Apryse WebViewer v10, with programmatic colour-coded highlights per clause category. Click any clause card in the dashboard to jump directly to its page, and vice versa.

6. AI Chat with Dual Retrieval: Ask questions in plain English — "What happens if I file a claim after 30 days?" or "Does this non-compete apply to freelance work?" The system uses two retrieval strategies in parallel:

Keyword + severity scoring — relevance boosted by severity rating (critical ×3, warning ×2)
pgvector semantic search — Gemini embedding cosine similarity for meaning-based clause matching

The best results are merged and sent to Gemini 3 Flash for fast, context-accurate answers streamed in real-time via Server-Sent Events.

7. Voice Input & Audio Output: Speak questions via microphone (Deepgram Nova-3 STT). Hear responses and document summaries read aloud (Deepgram Aura-2 TTS — opt-in, never auto-played).

What ClearClause Catches in the Real World

Document	What Gets Flagged
Health Insurance	Coverage exclusions in riders, pre-authorization traps, out-of-network penalties, waiting period fine print
Life Insurance	Contestability periods, benefit reduction milestones, suicide clause timelines, premium escalation triggers
Employment Contract	Overbroad non-competes, IP assignment on personal projects, forced arbitration, at-will with no severance
Rental Lease	Landlord 24-hr entry rights, auto rent increases, security deposit forfeiture conditions, liability waivers
B2B / Vendor Agreement	Unlimited indemnification, auto-renewal lock-ins, SLA exclusion carve-outs, unilateral amendment clauses

How We Built It

Backend — Python 3.11 + FastAPI

The core is a fully async pipeline: Apryse OCR → PII Redaction → Gemini Analysis → Clause-Position Matching → Vector Embedding Indexing

Each stage pushes real-time progress via SSE — users watch the pipeline run live rather than staring at a spinner
A token-bucket rate limiter with exponential backoff coordinates the Gemini API quota across concurrent sessions without losing requests
A session manager gives each document an isolated pipeline with independent progress state, a 30-minute TTL, and automatic cleanup — PDF bytes stored in PostgreSQL alongside the session row, deleted on expiry, never on disk
Clause embeddings are computed with Gemini text-embedding-004 and indexed in pgvector for semantic chat retrieval, with keyword/severity scoring as a fallback if the vector DB is unavailable
Per-IP token bucket (120 req/min) and concurrent analysis limits prevent abuse
Chat responses are streamed via SSE for real-time token-by-token delivery to the frontend

Frontend — React 18 + Vite

A three-panel analysis view:

Dashboard — severity-sorted clause cards, category breakdown bar, fairness comparison tab with "Your Doc vs. Industry Standard"
PDF Viewer — Apryse WebViewer with programmatic highlight annotations per clause, click-to-jump navigation
AI Chat Panel — text + voice input (MediaRecorder → Deepgram Nova-3), TTS playback toggle, suggested starter questions, chat history persisted for the session.

Plus dark/light theme, first-time onboarding overlay, offline detection banner, and AbortController on every in-flight request for clean cancellation.

Infrastructure

Backend: Dockerized Python 3.11, deployed on Akamai LKE (Kubernetes) with liveness probes, readiness checks, rolling deploys, and Kubernetes Secrets for API keys
Frontend: Deployed on Vercel with edge delivery
Database: PostgreSQL + pgvector for session persistence, document storage, and semantic vector retrieval
CI/CD: GitHub Actions builds the Docker image from scratch (no cache), pushes to Docker Hub, and deploys to LKE automatically on every push to main

Technologies Used

Technology	Role
Python 3.11 + FastAPI	Async backend with SSE streaming
React 18 + Vite	Frontend SPA with context, hooks, routing
Gemini 3.1 Pro Preview	Clause classification and risk analysis
Gemini 3 Flash	Low-latency conversational document Q&A (SSE streamed)
Gemini text-embedding-004	Clause embeddings for semantic retrieval
Apryse SDK	PDF OCR with word-level bounding boxes
Apryse WebViewer v10	Annotated PDF viewer with programmatic highlights
Deepgram Nova-3	Speech-to-text for voice questions
Deepgram Aura-2	Text-to-speech for audio responses
Microsoft Presidio	Named entity recognition for PII detection
pgvector + PostgreSQL	Vector similarity search + session persistence
Docker + Kubernetes (Akamai LKE)	Container orchestration with health probes
GitHub Actions	CI/CD auto-deploy pipeline
Vercel	Frontend hosting and edge delivery

Challenges We Ran Into

Rate Limit Storm. Our frontend polled for status every 2 seconds. Gemini's rate limit: 30 req/min, exactly matching our poll rate. Once the bucket drained, every request returned 429, but our error handler tagged all API failures as "session not found," triggering immediate retries in an infinite loop. The fix required three changes at once: differentiate 429 from 404, replace setInterval with recursive setTimeout + exponential backoff, and increase our headroom to 120 rpm.

PII Leaking to the LLM. Insurance and medical documents contain SSNs, health conditions, and financial identifiers. We caught ourselves sending raw document text to Gemini early in development. We rebuilt PII handling as a mandatory pre-processing layer — Gemini operates on placeholders, the original text is restored for viewer output only, and the mapping is tracked across analysis, so highlights still align to the correct document position.

Deepgram SDK v6 Breaking Changes. The v3→v6 SDK upgrade removed PrerecordedOptions entirely and made transcribe_file() keyword-only. The published docs hadn't caught up — we had to inspect the SDK source directly to discover the new calling convention.

Accomplishments We're Proud Of

Zero PII exposure. The LLM never sees raw personal data. For insurance documents containing health histories and SSNs, this is an ethical requirement, not a feature checkbox.

Quantifiable fairness. Every document gets a 0–100 score. For the first time, you can see whether your health insurance policy is above or below industry standard on every individual clause — not just told it's "complicated."

Dual-retrieval chat that actually works. Combining keyword/severity scoring with pgvector cosine similarity makes the chat system accurate on document-specific edge cases that pure semantic search misses and pure keyword search can't reason about.

End-to-end voice. Microphone → Deepgram Nova-3 → Gemini Flash → Deepgram Aura-2 → speaker. Ask a question about your lease with your voice, and hear the answer read back. No typing required.

What We Learned

Pre-processing PII redaction before LLM calls is both a technical discipline and an ethical responsibility. It can't be a last-minute addition — it has to be architected into the pipeline from the start.

Token-bucket rate limiting is non-negotiable with metered APIs. A naive polling implementation can burn an entire API quota in seconds and then enter a retry loop that makes it worse.

SSE streaming produces a qualitatively different user experience than polling for multi-stage pipelines. Watching each step complete live — OCR, redaction, analysis — turns a 30-second wait into an engaging process.

What's Next for ClearClause

Clause negotiation templates. AI-generated alternative language you can propose back to your insurer, landlord, or employer — not just "this is risky" but "here's a fairer version."

Insurance plan comparison: Upload three health insurance policies side by side and see which one actually covers what you need across directly comparable clauses.

Browser extension: Analyze terms of service inline before you click "I Agree." Flag the risk level before consent is given.

Multi-language support: Contracts in Spanish, French, German and other languages — the same populations who are most likely to sign without understanding are also the least likely to have access to English-first legal tools.

AI Disclosure

ClearClause uses the following AI models and external services:

Model / Service	Usage
Google Gemini 3.1 Pro Preview	Document clause analysis and risk classification
Google Gemini 3 Flash	Conversational document Q&A (SSE streamed)
Google Gemini text-embedding-004	Clause embeddings for vector retrieval
Deepgram Nova-3	Speech-to-text transcription
Deepgram Aura-2	Text-to-speech audio generation
Apryse SDK	PDF OCR and WebViewer rendering
Microsoft Presidio	Named entity recognition for PII detection