Inspiration

Dario Amodei's essay Machines of Loving Grace opens with a question: what does a world look like where AI eliminates not just inefficiency, but injustice? One of the starkest injustices I see every day here in Kigali is the gap between the law as it is written and the law as it is understood.

Rwanda has some of the most progressive legislation on the continent — a Constitution that guarantees gender equality, a Labour Code that protects workers, a Land Law that secures tenure rights. But these documents are written in dense legal language, published in trilingual government gazettes, and effectively inaccessible to the citizens they are meant to protect. A worker who doesn't know their severance rights cannot claim them. A tenant who doesn't know eviction law cannot contest it.

The barrier isn't the law. The barrier is comprehension. That felt like exactly the kind of problem AI should solve.


What I Built

LexRwanda is a Retrieval-Augmented Generation (RAG) legal assistant grounded entirely in official Rwandan legal documents. A citizen types a question in plain English — "Can my employer fire me without notice?" — and receives a clear, accurate answer with citations to the exact article it came from.

It indexes five core Rwandan legal codes:

  • Constitution of Rwanda (Revised 2015)
  • Labour Code — Law No. 66/2018
  • Rwanda Land Law (2021)
  • Law Governing the Office of Notary (2023)
  • Notary Amendment Law

Every answer shows a confidence level (HIGH / MEDIUM / LOW), expandable source cards with verbatim excerpts, and a persistent disclaimer that this is legal information, not legal advice. When the evidence is weak, the system says so and recommends consulting a lawyer.


How I Built It

The architecture has three layers:

1. Ingestion pipeline PDFs are parsed with pdfplumber, chunked into 500-token article-aware segments with 100-token overlap, embedded using fastembed (BAAI/bge-small-en-v1.5, local ONNX — no API dependency), and stored in ChromaDB with cosine similarity indexing.

2. Retrieval At query time, the question is embedded and the top-5 most semantically similar chunks are retrieved (minimum 0.55 cosine similarity). Confidence is computed from the average similarity score of retrieved chunks.

3. Generation Claude (claude-sonnet-4-6) receives only the retrieved legal text — never the full corpus — and generates a grounded, plain-language answer citing specific articles. Streaming is handled via Server-Sent Events so the answer appears token by token.

The frontend is Next.js + Tailwind CSS. The backend is FastAPI. Both are deployed on Render.


Challenges

The PDF problem was the hardest technical challenge. Rwanda's official statutes are published as A4 landscape gazette PDFs with three columns: Kinyarwanda | English | French. When I first ran ingestion, pdfplumber extracted all three columns simultaneously, producing garbled text like "umuntu person personne uburenganzira rights droits" in the same sentence. Retrieval was completely broken.

The fix was writing a column detection function: if a page is landscape (width > height), crop to the middle 33–64% of the page width to extract only the English column. Portrait pages are extracted normally. That single function transformed the retrieval quality.

The embeddings problem came next. My original plan was sentence-transformers, but it requires torch, which has no wheel for Python 3.13. I switched to fastembed — an ONNX-based library with no torch dependency — and it worked perfectly. Faster too.

The type mismatch took two hours to debug. fastembed returns numpy.float32 vectors. ChromaDB silently rejects them and throws a ValueError deep in its validation. The fix was a single explicit cast: [[float(x) for x in v] for v in embeddings]. Two hours for one line.

Deployment on Render's free tier meant dealing with cold starts (services sleep after 15 min), a Python 3.14 default that broke voyageai, and the need to commit the pre-built ChromaDB (8.6 MB) into the repository so Render doesn't need to re-run ingestion on every deploy.


What I Learned

  • RAG quality lives or dies on ingestion quality. A clean, well-structured chunk is worth more than any retrieval trick. Garbage in, garbage out.
  • Streaming UX matters enormously. An answer that appears instantly (even if still writing) feels trustworthy. A 10-second blank screen feels broken.
  • Grounding AI in primary sources — real law, real text — is not just a technical choice. It is an ethical one. LexRwanda cannot hallucinate a law that doesn't exist, because it only speaks from what it can retrieve.
  • The communities who most need AI tools are often the hardest to build for: low connectivity, multiple languages, documents published in formats optimized for print not machines.

What's Next

  • Kinyarwanda language support — the majority of Rwandan citizens are most comfortable in Kinyarwanda, not English
  • Expanded corpus — tax law, business registration, family law, criminal procedure
  • Voice input — for citizens with limited literacy
  • SMS/WhatsApp interface — for users without reliable internet access to a full browser

The architecture is designed to be forked. The same pipeline — ingest → embed → retrieve → generate — can serve any African jurisdiction with public legal documents.

Built With

Share this project:

Updates