ARIA

Inspiration

ARIA was born from the vision of making the digital world accessible to everyone. Inspired by the challenges faced by people with disabilities, we wanted to harness modern AI and cloud technologies to build a tool that lets users navigate any website using just their voice. The idea was to remove the barriers imposed by traditional interfaces and create a more inclusive, intuitive browsing experience.

What It Does

ARIA is a Chrome extension paired with a FastAPI backend that together enable voice-based navigation of any website. By simply clicking the "Start" button, users can issue spoken commands that are recorded and processed. ARIA:

Captures audio from the user and automatically uploads it to AWS S3.
Utilizes AWS Transcribe to convert the audio into text.
Calls the Gemini API to interpret the transcription and identify the desired action.
Returns an actionable command (such as clicking a button or typing text) which is executed on the current webpage. This seamless integration of voice input and automated web interactions empowers users to control web pages without traditional input devices.

How We Built It

Chrome Extension:
- Developed a user-friendly extension with a simple UI that includes a “Start” button.
- The extension records user audio, extracts ARIA element information from the webpage, and sends both to the backend for processing.
- Utilized the Downloads API and Web Audio APIs to manage file downloads and real-time audio recording.
Backend with FastAPI:
- Built a FastAPI application that exposes a single endpoint to handle audio file submissions.
- The backend uploads the audio to an AWS S3 bucket, triggers an AWS Transcribe job to convert speech to text, and then sends the transcription along with ARIA data to Gemini to determine the correct user action.
- The response from Gemini is sent back to the Chrome extension, which automates the corresponding interaction (e.g., clicking a button or typing into a field).
Cloud Integration & AI:
- AWS S3 is used for scalable storage of audio recordings.
- AWS Transcribe efficiently converts speech to text.
- Google’s Gemini API powers the decision-making process by interpreting user commands and identifying the appropriate UI elements.
Security & Permissions:
- We carefully managed user permissions (microphone and downloads) and ensured secure interactions with cloud services.

Challenges We Ran Into

Chrome Extension Contexts:
Adapting to Manifest V3 and managing execution contexts (popups vs. background scripts) required significant experimentation and creative problem-solving.
Asynchronous Workflows:
Coordinating the asynchronous processes of audio recording, S3 uploads, transcription, and API calls was complex, particularly in error handling and ensuring a smooth user experience.
User Permissions and Privacy:
Balancing the need for microphone access and file downloads with user privacy and security settings was challenging, as it required working within strict browser policies.
Cloud Service Integration:
Integrating multiple cloud services (S3, Transcribe, and Gemini) into a coherent and efficient pipeline posed technical hurdles, especially around managing API credentials and ensuring low-latency responses.

Accomplishments That We're Proud Of

Accessibility Innovation:
ARIA is a pioneering tool that leverages voice commands to significantly enhance web accessibility for people with disabilities.
Seamless Integration:
We successfully integrated modern web technologies and cloud services into a single, user-friendly system that processes voice commands in near real-time.
Robust Backend Architecture:
Our FastAPI backend effectively orchestrates multiple AWS services and the Gemini API, demonstrating a scalable approach to real-time audio processing and action automation.
User-Centered Design:
Despite the technical challenges, we kept the user experience at the forefront, resulting in an intuitive interface that simplifies complex tasks into a single click.

What We Learned

Technical Proficiency:
We deepened our understanding of Chrome Extension development, cloud service integration, and AI-driven natural language processing.
Project Management:
Coordinating multiple technologies and managing asynchronous workflows taught us the importance of careful planning and iterative development.
User-Centric Design:
Working on accessibility forced us to consider real-world user needs and design with empathy, ensuring our solution is both innovative and practical.
Resilience:
Every challenge, from handling browser permission quirks to integrating disparate cloud APIs, was an opportunity to learn, adapt, and improve our approach.

What's Next for ARIA

Seamless Voice Interaction:
We intend to implement a trigger word feature, similar to Siri, Alexa, and Google Assistant that will automatically start listening for user commands. This will create a more natural and hands-free experience.
Reduced Latency & Faster Automation:
We plan to optimize our entire pipeline, from audio capture and processing to executing actions, by streamlining backend processing and exploring low-latency communication protocols, ensuring near-instantaneous automation.
Enhanced Command Interpretation:
We plan to refine the Gemini API integration to handle a broader range of voice commands and complex interactions, making ARIA even more versatile.
User Customization:
Future updates may allow users to customize command mappings and tailor ARIA’s behavior to their individual needs.
Expanded Browser Support:
We aim to extend ARIA’s capabilities beyond Chrome, potentially supporting other browsers to reach a wider audience.
Continuous Accessibility Improvements:
By incorporating feedback from users with disabilities, we will continue to evolve ARIA, ensuring it remains at the forefront of accessible technology.