Inspiration
Every researcher knows the pain — hundreds of PDFs, dozens of tabs, and hours lost just finding information you already read somewhere. I watched PhD students spend entire weekends on literature reviews that should take hours. The real problem wasn't a lack of AI tools — it was that existing tools hallucinated, gave uncited answers, and weren't built around academic workflows. NotebookLM was the closest thing, but it lacked hybrid search, knowledge visualization, and self-hosting for privacy-sensitive research. I wanted to build something that treated citations as a first-class feature, not an afterthought.
** What it does**
MemexLLM is a production-ready document intelligence platform that lets you chat with your entire research library and get fully cited, hallucination-free answers in seconds.
- Upload 50+ formats — PDFs, YouTube videos, audio lectures, DOCX, PPTX and more
- Ask questions naturally and get answers with exact page numbers and chunk-level citations
- Visualize a knowledge graph showing how concepts connect across all your sources
- Generate AI-powered multi-speaker podcasts from dense academic papers
- Auto-create flashcards, quizzes, and mind maps for retention and exam prep
- Self-host completely free under the MIT license for full data privacy
How we built it
The stack was chosen deliberately for each layer of the problem:
Frontend: Next.js 16 with App Router and React Server Components for performance. Optimistic UI updates make chat feel instantaneous even during streaming responses.
Backend: FastAPI for its native async support — critical for handling concurrent long-running LLM requests. Business logic lives in a clean service layer (ChatService, IngestionService) with PostgreSQL-backed background workers via Procrastinate for heavy document processing.
RAG Pipeline: Built on LlamaIndex with a multi-stage retrieval system — parallel dense (semantic) and sparse (BM25) search, HyDE for complex queries, Query Fusion to merge results, and Cohere reranking for a final precision pass. Google Gemini 2.5 Flash handles generation for its large context window and multimodal capabilities.
Storage: Polyglot persistence — PostgreSQL (Supabase) for structured data, Qdrant for vector similarity search, and Supabase Storage for raw files. Row-level security ensures complete data isolation between users.
Audio: Kokoro TTS powers the multi-speaker podcast generation pipeline.
Challenges we ran into
Hallucination prevention was the hardest problem. We implemented a multi-layer policy system with confidence score thresholds — if retrieved context scores below 0.5, the AI returns nothing rather than guessing. Getting this balance right without making the system overly restrictive took significant tuning.
Streaming + database consistency was a nasty race condition. Token streaming and async DB writes would conflict, risking message loss. We solved it by creating a new async session post-stream to persist messages only after the stream completes, with citations extracted and stored separately.
Memory-safe large file processing — naive document ingestion would OOM on anything over ~20MB. We rebuilt the upload pipeline with streaming chunked processing that handles 100MB+ PDFs without breaking a sweat.
Hybrid search fusion — balancing semantic and keyword retrieval results without one drowning out the other required careful weight tuning and iterative testing across diverse query types.
** Accomplishments that we're proud of**
- Zero hallucinations — every response is grounded in source material with verifiable citations
- Sub-500ms retrieval — hybrid search delivers results faster than most single-method systems
- 100MB+ file support — memory-safe streaming handles documents most platforms choke on
- Full self-hosting — researchers at institutions with strict data policies can run it entirely on their own infrastructure
- Adopted at 150+ universities — real researchers using it for real work within weeks of launch
- Built solo end-to-end — frontend, backend, RAG pipeline, auth, DevOps, and design
What we learned
- RAG is an engineering problem, not just a prompt problem. Hybrid retrieval, reranking, and policy layers matter far more than prompt wording.
- Streaming state is deceptively complex. Managing consistency across async boundaries taught us a lot about distributed systems thinking at a small scale.
- Citations aren't a feature — they're the foundation. Building attribution in from day one changed every architecture decision downstream.
- Vector database schema design is critical early. Retrofitting chunk metadata into Qdrant payloads after the fact was painful — design for retrieval from the start.
- Users trust AI less than you expect, and rightly so. The most positive feedback was always about citations and transparency, not raw capability.
** What's next for MemexLLM**
- Real-time collaboration — multi-user notebooks with shared libraries and live co-editing
- Custom embedding models — support for domain-specific embeddings (biomedical, legal, financial)
- Advanced analytics — usage insights, retrieval quality metrics, and query performance dashboards
- Mobile app — native iOS and Android for on-the-go research with audio mode front and center
- Offline mode — local processing for air-gapped or highly sensitive research environments
- Plugin system — extensible architecture so teams can build custom content generators on top
- Integration APIs — webhooks and REST endpoints for connecting with Zotero, Notion, and Obsidian
Log in or sign up for Devpost to join the conversation.