-
-
ClearClause — Upload any contract and get an instant AI-powered legal risk briefing
-
Real-time pipeline progress via Server-Sent Events : watch OCR, PII redaction, and AI analysis happen live
-
Every clause classified by risk : Rights Given Up, One-Sided Terms, Financial Impact, Missing Protections
-
Your original document with color-coded clause highlights — click any card to jump to its exact location
-
Ask questions about your document in plain English — get instant answers with voice input and audio output
-
Privacy-first: 24 personal data items automatically redacted before AI analysis — the LLM never sees your raw data
-
Quantifiable fairness scoring — see exactly where your contract falls below industry standard
-
Download your annotated contract — keep a permanent copy with all flagged clauses highlighted by risk category
ClearClause: Know What You're Signing. Before You Sign It.
Inspiration
In 2023, a family in Texas filed a health insurance claim for their daughter's asthma treatment. It was denied.
The reason: a clause on page 23, paragraph 4(b), buried in a rider appendix — "Coverage excludes any condition diagnosed within the prior 60 months." They had been paying premiums for three years. They never knew.
This isn't an edge case. This is how the system works.
Health insurance policies bury coverage exclusions, pre-authorisation requirements, and out-of-network penalties deep in appendices — discovered only when you're already sick and it's too late to switch.
Employment contracts embed non-compete clauses that block you from working in your field for two years, IP assignment clauses that claim ownership of your side projects, and forced arbitration that strips your right to sue.
Rental leases include 24-hour landlord entry rights, automatic rent escalation, and liability waivers that remove protections tenants assumed they had.
The legal system is designed for people who can afford lawyers. ClearClause is designed for everyone else.
I started building this after signing an internship offer without fully understanding the IP assignment clause. It claimed ownership over anything I built — even on my own time, on my own laptop. I realised it months later. That moment stuck: a 30-second scan by the right tool would have caught it instantly.
What It Does
Upload any PDF — an insurance policy, employment contract, lease, NDA, or terms of service — and ClearClause delivers a complete legal risk briefing in under 60 seconds.
The Pipeline
1. OCR Extraction : Apryse SDK extracts every word with precise bounding-box positions. This enables the annotated viewer to highlight the exact clause in your original document — not a keyword match, the actual span.
2. PII Shield — Privacy Before AI: Before any AI model sees your document, a dual-engine PII detection system (Microsoft Presidio NER + regex fallback) redacts SSNs, emails, phone numbers, credit card numbers, names, addresses, and dates of birth. For insurance and medical documents, this is non-negotiable. The LLM never sees your raw personal data.
3. Clause-by-Clause Risk Classification: Gemini 3.1 Pro analyses every clause and assigns it a category, severity, plain-English explanation, one-sentence summary, and fairness rating against industry norms:
| Icon | Category | What it means |
|---|---|---|
| 🔴 | Rights Given Up | Things you're agreeing to surrender |
| 🟠 | One-Sided Terms | Clauses that heavily favor the other party |
| 🟡 | Financial Impact | Hidden costs, penalties, escalation clauses |
| 🔵 | Missing Protections | Standard protections absent from your document |
| 🟢 | Standard | Fair, commonly-seen terms |
4. Fairness Score (0–100): A computed score tells you at a glance how balanced the document is, with side-by-side "Your Document vs. Industry Standard" comparisons for every non-standard clause, plus actionable negotiation suggestions.
5. Annotated PDF Viewer: Your original document rendered in Apryse WebViewer v10, with programmatic colour-coded highlights per clause category. Click any clause card in the dashboard to jump directly to its page, and vice versa.
6. AI Chat with Dual Retrieval: Ask questions in plain English — "What happens if I file a claim after 30 days?" or "Does this non-compete apply to freelance work?" The system uses two retrieval strategies in parallel:
- Keyword + severity scoring — relevance boosted by severity rating (critical ×3, warning ×2)
- pgvector semantic search — Gemini embedding cosine similarity for meaning-based clause matching
The best results are merged and sent to Gemini 3 Flash for fast, context-accurate answers streamed in real-time via Server-Sent Events.
7. Voice Input & Audio Output: Speak questions via microphone (Deepgram Nova-3 STT). Hear responses and document summaries read aloud (Deepgram Aura-2 TTS — opt-in, never auto-played).
What ClearClause Catches in the Real World
| Document | What Gets Flagged |
|---|---|
| Health Insurance | Coverage exclusions in riders, pre-authorization traps, out-of-network penalties, waiting period fine print |
| Life Insurance | Contestability periods, benefit reduction milestones, suicide clause timelines, premium escalation triggers |
| Employment Contract | Overbroad non-competes, IP assignment on personal projects, forced arbitration, at-will with no severance |
| Rental Lease | Landlord 24-hr entry rights, auto rent increases, security deposit forfeiture conditions, liability waivers |
| B2B / Vendor Agreement | Unlimited indemnification, auto-renewal lock-ins, SLA exclusion carve-outs, unilateral amendment clauses |
How We Built It
Backend — Python 3.11 + FastAPI
The core is a fully async pipeline: Apryse OCR → PII Redaction → Gemini Analysis → Clause-Position Matching → Vector Embedding Indexing
- Each stage pushes real-time progress via SSE — users watch the pipeline run live rather than staring at a spinner
- A token-bucket rate limiter with exponential backoff coordinates the Gemini API quota across concurrent sessions without losing requests
- A session manager gives each document an isolated pipeline with independent progress state, a 30-minute TTL, and automatic cleanup — PDF bytes stored in PostgreSQL alongside the session row, deleted on expiry, never on disk
- Clause embeddings are computed with Gemini text-embedding-004 and indexed in pgvector for semantic chat retrieval, with keyword/severity scoring as a fallback if the vector DB is unavailable
- Per-IP token bucket (120 req/min) and concurrent analysis limits prevent abuse
- Chat responses are streamed via SSE for real-time token-by-token delivery to the frontend
Frontend — React 18 + Vite
A three-panel analysis view:
- Dashboard — severity-sorted clause cards, category breakdown bar, fairness comparison tab with "Your Doc vs. Industry Standard"
- PDF Viewer — Apryse WebViewer with programmatic highlight annotations per clause, click-to-jump navigation
- AI Chat Panel — text + voice input (MediaRecorder → Deepgram Nova-3), TTS playback toggle, suggested starter questions, chat history persisted for the session.
Plus dark/light theme, first-time onboarding overlay, offline detection banner, and AbortController on every in-flight request for clean cancellation.
Infrastructure
- Backend: Dockerized Python 3.11, deployed on Akamai LKE (Kubernetes) with liveness probes, readiness checks, rolling deploys, and Kubernetes Secrets for API keys
- Frontend: Deployed on Vercel with edge delivery
- Database: PostgreSQL + pgvector for session persistence, document storage, and semantic vector retrieval
- CI/CD: GitHub Actions builds the Docker image from scratch (no cache), pushes to Docker Hub, and deploys to LKE automatically on every push to
main
Technologies Used
| Technology | Role |
|---|---|
| Python 3.11 + FastAPI | Async backend with SSE streaming |
| React 18 + Vite | Frontend SPA with context, hooks, routing |
| Gemini 3.1 Pro Preview | Clause classification and risk analysis |
| Gemini 3 Flash | Low-latency conversational document Q&A (SSE streamed) |
| Gemini text-embedding-004 | Clause embeddings for semantic retrieval |
| Apryse SDK | PDF OCR with word-level bounding boxes |
| Apryse WebViewer v10 | Annotated PDF viewer with programmatic highlights |
| Deepgram Nova-3 | Speech-to-text for voice questions |
| Deepgram Aura-2 | Text-to-speech for audio responses |
| Microsoft Presidio | Named entity recognition for PII detection |
| pgvector + PostgreSQL | Vector similarity search + session persistence |
| Docker + Kubernetes (Akamai LKE) | Container orchestration with health probes |
| GitHub Actions | CI/CD auto-deploy pipeline |
| Vercel | Frontend hosting and edge delivery |
Challenges We Ran Into
Rate Limit Storm.
Our frontend polled for status every 2 seconds. Gemini's rate limit: 30 req/min, exactly matching our poll rate. Once the bucket drained, every request returned 429, but our error handler tagged all API failures as "session not found," triggering immediate retries in an infinite loop. The fix required three changes at once: differentiate 429 from 404, replace setInterval with recursive setTimeout + exponential backoff, and increase our headroom to 120 rpm.
PII Leaking to the LLM. Insurance and medical documents contain SSNs, health conditions, and financial identifiers. We caught ourselves sending raw document text to Gemini early in development. We rebuilt PII handling as a mandatory pre-processing layer — Gemini operates on placeholders, the original text is restored for viewer output only, and the mapping is tracked across analysis, so highlights still align to the correct document position.
Deepgram SDK v6 Breaking Changes.
The v3→v6 SDK upgrade removed PrerecordedOptions entirely and made transcribe_file() keyword-only. The published docs hadn't caught up — we had to inspect the SDK source directly to discover the new calling convention.
Accomplishments We're Proud Of
Zero PII exposure. The LLM never sees raw personal data. For insurance documents containing health histories and SSNs, this is an ethical requirement, not a feature checkbox.
Quantifiable fairness. Every document gets a 0–100 score. For the first time, you can see whether your health insurance policy is above or below industry standard on every individual clause — not just told it's "complicated."
Dual-retrieval chat that actually works. Combining keyword/severity scoring with pgvector cosine similarity makes the chat system accurate on document-specific edge cases that pure semantic search misses and pure keyword search can't reason about.
End-to-end voice. Microphone → Deepgram Nova-3 → Gemini Flash → Deepgram Aura-2 → speaker. Ask a question about your lease with your voice, and hear the answer read back. No typing required.
What We Learned
Pre-processing PII redaction before LLM calls is both a technical discipline and an ethical responsibility. It can't be a last-minute addition — it has to be architected into the pipeline from the start.
Token-bucket rate limiting is non-negotiable with metered APIs. A naive polling implementation can burn an entire API quota in seconds and then enter a retry loop that makes it worse.
SSE streaming produces a qualitatively different user experience than polling for multi-stage pipelines. Watching each step complete live — OCR, redaction, analysis — turns a 30-second wait into an engaging process.
What's Next for ClearClause
Clause negotiation templates. AI-generated alternative language you can propose back to your insurer, landlord, or employer — not just "this is risky" but "here's a fairer version."
Insurance plan comparison: Upload three health insurance policies side by side and see which one actually covers what you need across directly comparable clauses.
Browser extension: Analyze terms of service inline before you click "I Agree." Flag the risk level before consent is given.
Multi-language support: Contracts in Spanish, French, German and other languages — the same populations who are most likely to sign without understanding are also the least likely to have access to English-first legal tools.
AI Disclosure
ClearClause uses the following AI models and external services:
| Model / Service | Usage |
|---|---|
| Google Gemini 3.1 Pro Preview | Document clause analysis and risk classification |
| Google Gemini 3 Flash | Conversational document Q&A (SSE streamed) |
| Google Gemini text-embedding-004 | Clause embeddings for vector retrieval |
| Deepgram Nova-3 | Speech-to-text transcription |
| Deepgram Aura-2 | Text-to-speech audio generation |
| Apryse SDK | PDF OCR and WebViewer rendering |
| Microsoft Presidio | Named entity recognition for PII detection |
Built With
- apryse-sdk-(ocr)
- apryse-webviewer
- deepgram-aura-2-(tts)
- deepgram-nova-3-(stt)
- docker
- fastapi
- gemini-3-flash
- gemini-3.1-pro
- gemini-text-embedding-004
- github-actions
- kubernetes-(akamai-lke)
- microsoft-presidio
- pgvector
- postgresql
- python
- react-18
- vercel
- vite
Log in or sign up for Devpost to join the conversation.