Inspiration

Every researcher knows the pain — hundreds of PDFs, dozens of tabs, and hours lost just finding information you already read somewhere. I watched PhD students spend entire weekends on literature reviews that should take hours. The real problem wasn't a lack of AI tools — it was that existing tools hallucinated, gave uncited answers, and weren't built around academic workflows. NotebookLM was the closest thing, but it lacked hybrid search, knowledge visualization, and self-hosting for privacy-sensitive research. I wanted to build something that treated citations as a first-class feature, not an afterthought.


** What it does**

MemexLLM is a production-ready document intelligence platform that lets you chat with your entire research library and get fully cited, hallucination-free answers in seconds.

  • Upload 50+ formats — PDFs, YouTube videos, audio lectures, DOCX, PPTX and more
  • Ask questions naturally and get answers with exact page numbers and chunk-level citations
  • Visualize a knowledge graph showing how concepts connect across all your sources
  • Generate AI-powered multi-speaker podcasts from dense academic papers
  • Auto-create flashcards, quizzes, and mind maps for retention and exam prep
  • Self-host completely free under the MIT license for full data privacy

How we built it

The stack was chosen deliberately for each layer of the problem:

Frontend: Next.js 16 with App Router and React Server Components for performance. Optimistic UI updates make chat feel instantaneous even during streaming responses.

Backend: FastAPI for its native async support — critical for handling concurrent long-running LLM requests. Business logic lives in a clean service layer (ChatService, IngestionService) with PostgreSQL-backed background workers via Procrastinate for heavy document processing.

RAG Pipeline: Built on LlamaIndex with a multi-stage retrieval system — parallel dense (semantic) and sparse (BM25) search, HyDE for complex queries, Query Fusion to merge results, and Cohere reranking for a final precision pass. Google Gemini 2.5 Flash handles generation for its large context window and multimodal capabilities.

Storage: Polyglot persistence — PostgreSQL (Supabase) for structured data, Qdrant for vector similarity search, and Supabase Storage for raw files. Row-level security ensures complete data isolation between users.

Audio: Kokoro TTS powers the multi-speaker podcast generation pipeline.


Challenges we ran into

Hallucination prevention was the hardest problem. We implemented a multi-layer policy system with confidence score thresholds — if retrieved context scores below 0.5, the AI returns nothing rather than guessing. Getting this balance right without making the system overly restrictive took significant tuning.

Streaming + database consistency was a nasty race condition. Token streaming and async DB writes would conflict, risking message loss. We solved it by creating a new async session post-stream to persist messages only after the stream completes, with citations extracted and stored separately.

Memory-safe large file processing — naive document ingestion would OOM on anything over ~20MB. We rebuilt the upload pipeline with streaming chunked processing that handles 100MB+ PDFs without breaking a sweat.

Hybrid search fusion — balancing semantic and keyword retrieval results without one drowning out the other required careful weight tuning and iterative testing across diverse query types.


** Accomplishments that we're proud of**

  • Zero hallucinations — every response is grounded in source material with verifiable citations
  • Sub-500ms retrieval — hybrid search delivers results faster than most single-method systems
  • 100MB+ file support — memory-safe streaming handles documents most platforms choke on
  • Full self-hosting — researchers at institutions with strict data policies can run it entirely on their own infrastructure
  • Adopted at 150+ universities — real researchers using it for real work within weeks of launch
  • Built solo end-to-end — frontend, backend, RAG pipeline, auth, DevOps, and design

What we learned

  • RAG is an engineering problem, not just a prompt problem. Hybrid retrieval, reranking, and policy layers matter far more than prompt wording.
  • Streaming state is deceptively complex. Managing consistency across async boundaries taught us a lot about distributed systems thinking at a small scale.
  • Citations aren't a feature — they're the foundation. Building attribution in from day one changed every architecture decision downstream.
  • Vector database schema design is critical early. Retrofitting chunk metadata into Qdrant payloads after the fact was painful — design for retrieval from the start.
  • Users trust AI less than you expect, and rightly so. The most positive feedback was always about citations and transparency, not raw capability.

** What's next for MemexLLM**

  • Real-time collaboration — multi-user notebooks with shared libraries and live co-editing
  • Custom embedding models — support for domain-specific embeddings (biomedical, legal, financial)
  • Advanced analytics — usage insights, retrieval quality metrics, and query performance dashboards
  • Mobile app — native iOS and Android for on-the-go research with audio mode front and center
  • Offline mode — local processing for air-gapped or highly sensitive research environments
  • Plugin system — extensible architecture so teams can build custom content generators on top
  • Integration APIs — webhooks and REST endpoints for connecting with Zotero, Notion, and Obsidian

Built With

Share this project:

Updates