AVA | Devpost

AVA Architecture
Example: AVA starting to search on Wikipedia (searching for George Washington)
Example: AVA on George Washington Wikipedia page
Example: AVA on MrBeast YouTube
AVA Avatar

Inspiration

Over 2.5 billion people worldwide require assistive technology, yet many struggle with traditional web navigation. Millions of elderly users and individuals with mobility impairments face barriers when interacting with digital content. Our team was inspired to create AVA (Accessible Voice Assistant, pronounced a-va), a voice-powered assistant that enables seamless, hands-free browsing, ensuring that the internet remains accessible to everyone, regardless of physical limitations.

What it does

AVA listens for the keywords "Hey Ava" and records the user's query to execute a sequence of tasks. Whether it's sending an email or scheduling a flight from Point A to Point B, AVA enables hands-free interaction, making digital access more intuitive and accessible for everyone.

How we built it

AVA's frontend is built using PyGame, serving as a GLTF loader to render a 3D model that actively listens for the keywords "Hey Ava" using the speech_recognition library. Once triggered, the backend pipeline is set into motion.

For speech processing, AVA leverages Groq's transcription endpoint, which utilizes distil-whisper-large-v3-en to transcribe recorded MP3 audio files into text with near-instantaneous speed. The transcribed text then undergoes a sanitization phase using Groq as our LLM provider, ensuring that the task steps are well-defined and structured (JSON) to prevent misinterpretation by AVA's agentic workflow.

At its core, AVA is built on top of Browser Use, extending and enhancing its functionalities for more seamless agentic automation. Once the structured task steps are finalized, AVA executes each step methodically, providing real-time voice feedback to confirm actions and ensure a smooth hands-free experience.

To overcome the challenges posed by Groq's limited context window, we developed custom tooling that allows AVA to efficiently navigate websites and extract key information. This optimization enables AVA to process complex web pages and execute tasks effectively despite the context limitations.

Challenges we ran into

Initially, we planned to use Electron for the frontend but pivoted to PyGame to achieve a more lightweight and flexible implementation for rendering our 3D interactive model. This required us to restructure a significant portion of our UI and event-handling logic.

Additionally, during development, a merge conflict arose when a team member accidentally overwrote uncommitted changes after pulling the latest updates. Unfortunately, in the process, our .env file containing sensitive API keys was pushed to the repository. To resolve the inconsistencies, we had to completely redo our repository and quickly made the repository private. We quickly revoked and regenerated new keys, but this incident reinforced the importance of strict version control and better collaboration practices to prevent similar issues in the future.

A significant challenge was working within the constraints of Groq's context window. To address this, we developed custom tools that allow AVA to navigate websites more efficiently, focusing on extracting essential information without overwhelming the LLM's capacity.

Accomplishments that we're proud of

We're incredibly proud of the seamless real-time voice interaction we achieved with AVA, allowing users to control their browser completely hands-free. The integration of Groq's transcription model and LLM capabilities enabled blazing-fast speech-to-text conversion and efficient task processing, making AVA feel responsive and natural. Additionally, transitioning from Electron to PyGame was a major milestone, as it allowed us to create a lightweight yet visually engaging 3D model that actively listens and responds. Another accomplishment was optimizing our agentic workflow and developing custom tools to work around Groq's context limitations, ensuring that AVA understands and executes multi-step commands with minimal errors. Despite the challenges we faced with repository conflicts and security mishaps, we successfully rebuilt our system from the ground up and refined our architecture into a single, efficient pipeline.

What we learned

We deepened our understanding of API interactions, learning how to efficiently call and integrate services like Groq's endpoints while also developing our own FastAPI-based endpoints to streamline AVA's workflow. A major takeaway was working with Browser Use and Groq's LLM, as we had to extend functionality and develop custom tools to provide agents with more context, ensuring AVA could accurately understand and execute complex tasks within the constraints of the LLM's context window. Through these challenges, we refined our ability to build scalable, agent-driven systems that enhance user accessibility while optimizing for LLM performance.

What's next for AVA

We plan to enhance AVA's ability to recognize distinct voices, allowing it to apply user-specific context and preferences instantaneously. This would enable more personalized interactions tailored to each user's needs. Additionally, we aim to expand AVA's agentic capabilities, allowing it to handle more complex workflows seamlessly across multiple applications. We also intend to further refine our custom tooling for efficient web navigation and information extraction, potentially exploring ways to dynamically adjust our approach based on the complexity of different websites and tasks.