Inspiration

The inspiration came from the Technicia Hackathon, advanced AI should be usable by everyone, not just people who fit a narrow range of abilities or interaction styles. When building an agentic-memory multimodal assistant, we realized that most tools don’t consider users with dyslexia, color-vision differences, or mobility challenges. We wanted to change that. Our goal was to create an intelligent system that doesn’t just respond, but adapts. A system with high-contrast themes for color-blind users, dyslexia-friendly typography. This project was inspired by the belief that accessibility isn’t an add-on, it’s the foundation of truly human-centered technology.

What it does

We are building a multimodal, memory‑augmented agent system on top of the ROMA framework to deliver fast, reliable answers for both text and image+text queries. ROMA orchestrates multiple LLM roles in a recursive pipeline: the Atomizer structures the problem and inputs (including document images), the Planner decides whether a query is simple enough to answer directly or, if complex, decomposes it into a plan of sub‑tasks, the Executor carries out each sub‑task by invoking tools and LLMs, the Aggregator fuses intermediate results into coherent outputs, and the Verifier checks and refines the final answer. For simple tasks, ROMA short‑circuits to a direct response; for complex tasks, it performs deep recursive search—expanding, executing, and validating plans until the goal is met or a depth limit is reached—giving us both speed and robustness. A dedicated memory agent persists salient facts and conversation history so the system can reuse prior context, refine new answers with what it has learned, and update memory after each run. We store this memory and domain knowledge in Qdrant as vector embeddings, organized into collections (e.g., Capital One public transaction data from the Nessie API, as well as exercise and wellness_recipes datasets), enabling efficient retrieval‑augmented generation that grounds reasoning in relevant past interactions and domain data. During execution, the agent can leverage external tools such as a calculator and web search when helpful. Altogether, this architecture combines efficient handling of simple queries, rigorous recursive planning for complex ones, and persistent, retrieval‑backed memory to improve accuracy, continuity, and performance over time

How we built it

We built ROMA-VLM by extending the ROMA (Recursive Open Meta-Agents) framework with Vision Language Model capabilities, creating a multimodal AI system that processes both text and images through a recursive hierarchical structure. Our implementation leverages DSPy for language model interactions, LiteLLM for unified API access to multiple VLM providers via OpenRouter, and ROMA's BaseModule architecture to create five specialized multimodal agents: MultimodalAtomizer (decides if tasks need decomposition), MultimodalPlanner (breaks complex tasks into subtasks), MultimodalExecutor (processes atomic tasks with vision), MultimodalAggregator (synthesizes results from multiple subtasks), and MultimodalVerifier (validates outputs against image context). The core innovation lies in our image processing pipeline that automatically converts local image paths to base64-encoded data URIs with intelligent resizing (max 2048px dimensions) to prevent context window overflow, and our recursive image flow mechanism that ensures images propagate through the entire task hierarchy—from atomization through execution to verification. We built custom multimodal signatures using DSPy's signature system, allowing each agent to accept both text goals and image inputs, and implemented configurable prediction strategies (Chain of Thought, ReAct, CodeAct) per agent for optimal performance across different use cases. For enhanced capabilities, we integrated a vector-based memory system using Qdrant for persistent agentic memory, enabling the system to learn from previous interactions, and developed specialized domain agents (medical, travel, self-care, video analysis) with custom signature instructions and few-shot learning demos. Our benchmarking infrastructure uses the DocVQA dataset to evaluate both text-only and image+text query performance, measuring exact match and F1 scores. The entire system is built as a zero-modification extension of ROMA—using it as a library without core changes—making it a drop-in replacement that maintains full compatibility with existing ROMA workflows while adding comprehensive vision capabilities.

Challenges we ran into

  • Building a reliable agentic memory system was harder than expected. Getting the assistant to store information accurately, retrieve it at the right time, and avoid “hallucinating memories” required a lot of iteration and debugging.
  • Coordinating multiple AI models (VLM, memory, reasoning) into one smooth pipeline was a major challenge. Each model had different requirements, speeds, and formats, so getting them to communicate without breaking the flow took time.
  • Training and sourcing the right datasets for our VLM and memory components was difficult. We had to balance quality, size, licensing, and relevance while keeping everything lightweight enough to run efficiently.
  • Implementing iterative and recursive calls with Roma was complex, we ran into looping issues, tokens limits, and cases where the agent wouldn’t refine its answer correctly.
  • Designing with accessibility in mind required extra testing and thoughtful UI decisions. Ensuring high contrast visibility, dyslexia-friendly text.
  • Debugging multimodal interactions (especially mixing vision, text, memory) led to unexpected edge cases that took significant time to resolve.

Accomplishments that we're proud of

  • Built a working agentic memory system that can store user information, recall it contextually, and personalize future interactions, making the assistant feel genuinely intelligent and adaptive.
  • Implemented deep-research agents using Roma, enabling iterative and recursive reasoning chains that break down complex problems and refine answers automatically.
  • Successfully combined multiple AI models, curated the right datasets, and connected everything into one cohesive multimodal pipeline.
  • Created our own vision-language model (VLM) and integrated it with the memory system so the assistant can understand and reason across text, images, and user history.
  • Focused on inclusivity from the start, adding features like high-contrast modes for color-blind users, dyslexia-friendly fonts.
  • Unified all components, memory, vision, and reasoning, into a single inclusive, AI-powered assistant, which is something we’re incredibly proud of achieving within the project timeline.

What we learned

Throughout this project, we learned a tremendous amount, both technically and conceptually. One of the biggest areas of learning was agentic memory: understanding how an AI system can store past interactions, recall them intelligently, and use that context to guide future actions. Building this taught us how to manage memory states, structure information, and design an assistant that truly remembers the user.

We also explored Roma, deepening our understanding of how iterative and recursive calls can be used in a deep research workflow. This helped us build agents that can break down problems, refine answers step-by-step, and improve their outputs through self-guided reasoning.

On the machine learning side, we learned how to build and assemble different models, how to search for and curate specific datasets, and how to train or fine-tune components so they work together smoothly. We also gained experience in creating a VLM and using it alongside the memory system to make the assistant multimodal, adaptable, and context-aware.

Overall, this project pushed us to understand not just AI models, but how to connect them into one cohesive, intelligent, and inclusive system.

What's next for Inclusive

Moving forward, we want to take Inclusive to the next level by making it even smarter, more adaptive, and more accessible. Our first goal is to expand the agentic memory system so it can handle deeper reasoning, long-term personalization, and smarter context retention. We also plan to enhance our VLM with better image understanding, document parsing, and real-time multimodal interaction.

Accessibility will continue to guide every update. We want to add customizable accessibility presets, support for more neurodiverse-friendly UI options.

We’re also excited to integrate more advanced models, experiment with fine-tuned datasets, and explore on-device deployment for privacy-focused users. Eventually, we aim to open the platform to developers, allowing others to build new agents and features on top of Inclusive.

Our ultimate vision is to transform Inclusive into a truly universal AI assistant, one that adapts to every person, supports every ability, and grows smarter with every interaction.

Built With

  • alembic
  • amazon-web-services
  • anthropicclaude
  • asyncpg
  • binanceapi
  • black
  • coingeckoapi
  • defillamaapi
  • docker
  • dockercompose
  • dspy
  • e2b
  • exa
  • fastapi
  • fastmcp
  • googlegemini
  • gpt-4
  • httpx
  • huggingfacedatasets
  • ipython
  • jinja
  • jupyter
  • litellm
  • loguru
  • matplotlib
  • minio
  • mlflow
  • mypy
  • networkx
  • numpy
  • omegaconf
  • openai
  • openrouterapi
  • pandas
  • pillow
  • postgresql
  • pyarrow
  • pydantic
  • pytest
  • python-dotenv
  • python3.12
  • qdrant
  • requests
  • rich
  • roma
  • ruff
  • s3
  • serperapi
  • sqlalchemy
  • textual
  • typer
  • uvicorn
  • websocket
Share this project:

Updates