RedlineAI

RedlineAI ingests real-world contracts (PDF/DOCX/scans), classifies clauses, extracts key terms, flags risks against your house policy, proposes redlines, and pushes alerts (email/SMS/calls/calendar).

It runs on FastAPI + TiDB Serverless (SQL + Vector + FTS) + OpenAI with a practical, agentic pipeline.


🌱 Inspiration

Legal review weeks are chaos: scattered PDFs, manual searches, forgotten renewal windows, and “where’s that indemnity clause?” at 11:58 PM.

We wanted a reviewer-first system that:

  • Understands contracts, not just text.
  • Explains risk with citations to policy.
  • Suggests concrete edits (redlines) you can paste into Word.
  • Remembers deadlines and pings the right humans automatically.

🧠 What it does

  • Ingest: OCR, parse, chunk by headings/clauses; create embeddings + FTS.
  • Classify: Clause types (Auto-Renewal, Indemnity, DPA, SLA Uptime, etc.).
  • Extract: Dates, thresholds, renewal windows, liability caps, uptime %, notice periods.
  • Assess risk: Compare to policy; score and explain; link to rules.
  • Redline: Suggest strict/medium/soft rewrite alternatives.
  • Summarize: One-page exec brief; “what’s unusual, what’s due.”
  • Search/Q&A: Hybrid semantic + keyword (contract-scoped).
  • Alert: Email/SMS/voice/calls when severity ≥ threshold or deadlines approach.

🧰 How we built it

Stack

  • API: FastAPI + Uvicorn
  • DB: TiDB Serverless (transactional SQL + Vector + Full-Text Search)
    • Tables: contracts, tidb_vector_langchain (embeddings), clauses, risks, alerts, audit_log, users
  • Embeddings: OpenAI via langchain-openai
  • Vector store: TiDBVectorStore (stores document, embedding, meta)
  • LLM: OpenAI (classification, extraction, risk rationale, redlines)
  • Storage: S3 (original files) with presigned GET
  • Notifications: SendGrid (email), Twilio (SMS + voice), Google Calendar (optional)
  • Agent runtime: LangGraph (ingest pipeline), lightweight services for processing/alerts

Endpoints (core)

  • POST /api/v1/ingest — parse + chunk + embed (+S3 if logged-in); idempotent by file hash
  • POST /api/v1/contracts/{id}/process?use_llm=true — classify/extract/assess/write risks (idempotent, force=true)
  • GET /api/v1/contracts/{id}/risks?min_severity=5&clause_type=Auto-Renewal
  • POST /api/v1/contracts/{id}/qa — MMR retrieval + answers with citations
  • GET /api/v1/contracts/{id}/summary
  • GET /api/v1/alerts/due
  • POST /api/v1/alerts/dispatch — agentic notifications
  • Users:
    • GET /api/v1/users/{user_id}/contracts
    • GET /api/v1/users/{user_id}/contracts/{contract_id}/presign

Data model (simplified)

contracts(id, user_id, tenant, doc_type, original_filename, file_url, sha256, uploaded_at, …)

tidb_vector_langchain(id, embedding, document, meta JSON)
  ← meta.contract_id, meta.chunk_index, meta.page, …

clauses(id, contract_id, chunk_id, clause_type, confidence, extracted_json)

risks(id, contract_id, clause_id, severity, rule_id, rationale, suggested_fix)

alerts(id, contract_id, risk_id, kind, severity, message, channel_json, due_at, status, …)

# 😵‍💫 Challenges

- **PDFs are messy**: mixed fonts, headers/footers, TOCs — chunking by headings + layout metadata helped a lot.  
- **Latency**: embeddings + LLM can be slow — we parallelized where safe, cached embeddings, and streamed UI updates.  
- **Notifications**: ensuring we don’t spam — alerts table has `status`, `channel_json`, `due_at`, and unique keys to dedupe.  
- **Policy drift**: we version rules (`rule_id`) and log to `audit_log` for repeatability.  

---

# 🧪 What we learned

- **One database is cleaner**: TiDB’s SQL + Vector + FTS removed glue code and frustration.  
- **Typed agent steps**: JSON schemas per step tame LLM variability and make retries sane.  
- **Contract-scoped RAG matters**: restrict retrieval by `meta.contract_id` to avoid cross-document leakage.  
- **Idempotency everywhere**: `sha256` ingest, `force=true` process, alert upserts → less production pain.  
- **People want evidence**: every risk points back to the clause, policy rule, and a suggested fix.  

---

# 🚀 What’s next for RedlineAI

- Smarter clause similarity detection.  
- More integrations (Slack, Teams, Jira).  
- Fine-tuned models for niche contract types.  
- Multi-tenant dashboards with audit + reporting.  
- Workflow APIs for plugging into enterprise CLM.  

Built With

Share this project:

Updates