Voxa AI

commands creating
main
commands tab
debug mode
light theme
info

Voxa AI — Voice Meets Control

What if someone with no hands could control a computer — precisely, independently, and fully — using only their voice and AI?
Voxa AI was built to make that question obsolete.

An AI-powered voice agent that gives people with limited hand mobility full control over their computers — not just browsers or apps, but the entire OS.
It’s already working — and it’s not a prototype. This is a fully functional, production-ready software solution, ready to be used every day.

What It Does

Voxa AI is a voice-first desktop agent, designed for hands-free precision control.
The current MVP allows users to:

Click anywhere on screen using a custom two-step visual recognition grid
Speak naturally and have their intent understood
Execute macros and custom actions, fully by voice
Leverage Gemini for both reasoning and UI recognition
Switch to a non-smart mode for manual control without AI
Change theme and adjust flexible settings to fit personal preferences

Installation is simple — just run the installer and you’re ready to go.

How It Works

The system uses:

Python for backend logic and control flow
Gemini 2.5 Flash for language understanding and visual UI analysis
PyAutoGUI for screen control and input
Real-time speech recognition using Google Speech API

At the core is a custom-built dual-grid vision targeting system:

Screen is divided into a coarse grid
Gemini identifies the target cell
A finer grid is drawn within that cell
The system clicks with pixel-level accuracy

This architecture allows AI to interact with any UI element on any screen, without training data or model fine-tuning.

Why It Matters

We built a tool that works now — for real people, in real scenarios.
Voxa AI listens, understands, and acts — not just in chat, but on your desktop.
It doesn’t simulate assistance — it delivers it, reliably, in a form you can install and start using immediately.

What We Learned

You don’t need brain implants to make machines listen — just the right architecture
AI’s real power isn’t in talking — it’s in doing
Users with disabilities don’t want pity or gimmicks — they need power on their terms
Gemini can understand UIs from screenshots — if you prompt it precisely enough