Inspiration
The inspiration for Ultron came from a simple question: Why are our AI assistants trapped in a tab? We realized that while LLMs are incredibly smart, they often feel disconnected from our actual browsing habits. We wanted to build an assistant that doesn't just "talk" but actually executes. Our goal was to create a sleek, "HUD-like" interface that allows you to control your web experience through natural language—making the browser feel like a living, breathing extension of your intent.
What it does
Ultron is a multimodal AI command center built to streamline your digital life.
- Natural Intent Execution: Say "open the place where I watch videos" and Ultron instantly launches YouTube.
- Custom Keywords: Define personal shortcuts like
hamburger→instagram.comfor instant, non-AI navigation. - Multimodal Vision: Click the camera icon to let Ultron "see" your world and analyze snapshots in real-time.
- Voice Intelligence: Full Speech-to-Text (STT) and Text-to-Speech (TTS) support for a hands-free, interactive experience.
- Persistent Private Data: Secure, per-user storage for chat history and personal keywords.
How we built it
We built Ultron with a focus on speed and security:
- AI Engine: Powered by Google Gemini 2.0 Flash via the
@google/generative-aiSDK for low-latency multimodal reasoning. - Frontend: A custom-built, glassmorphic UI using Vanilla JavaScript (ES6) and modern CSS3. We opted for a framework-less approach to ensure near-instant load times.
- Backend: A secure Node.js/Express server that handles API interactions, file processing with
Multer, and persistence logic. - Deployment: The entire system is containerized with Docker and deployed on Google Cloud Run for serverless scalability.
Challenges we ran into
One major challenge was intent mapping. Ensuring the AI could distinguish between a general question and a command to open a specific website required precise system prompt engineering. We also faced hurdles with asynchronous media handling—synchronizing voice transcription, image snapshots, and streaming AI responses without blocking the main UI thread. Finally, we navigated the balance of statelessness on Cloud Run versus user persistence, ultimately building a lean, file-based data layer that ensures user history remains intact between deployments.
Accomplishments that we're proud of
- Fluid Performance: Achieving a "human-like" typing experience through streaming responses that make the AI feel alive.
- Multimodal Cohesion: Successfully merging vision and voice into a single interface that doesn't feel cluttered.
- Command Accuracy: Building a system that accurately understands "fuzzy" requests and maps them to the correct web destinations.
- Zero-Dependency UI: Creating a premium design from scratch without relying on heavy external UI kits.
What we learned
We discovered that Gemini 2.0 Flash is exceptionally good at "zero-shot" decision-making when given a list of available tools (commands). We also learned that modern Web APIs (like Web Speech and MediaDevices) are powerful enough to build sophisticated multimodal apps without needing proprietary desktop libraries. Build-wise, we reaffirmed that Vanilla JS often results in a better, snappier user experience than heavy frameworks when performance is priority #1.
What's next for Ultron
We see Ultron evolving into a "Proactive Agent."
- Action Layers: Moving beyond just opening URLs to performing specific actions within sites (e.g., "Post this code snippet to GitHub").
- External API Sync: Integrating with Google Calendar and Gmail to provide a truly unified assistant experience.
- Contextual Learning: Automatically suggesting custom keywords based on a user's most frequent actions.
- Mobile Companion: Bringing the Ultron experience to mobile browsers via a specialized PWA or Extension.
Built With
- css3
- docker
- dotenv
- express.js
- github
- google-ai-sdk
- google-cloud
- google-cloud-run
- google-gemini-2.0-flash
- html5
- javascript
- json
- media-devices-api
- mediadevices-api
- multer
- node.js
- vanilla-javascript
- web-api
- web-speech-api
- web-speech-api-(stt/tts)
Log in or sign up for Devpost to join the conversation.