VibeArchitect CLI

Inspiration

Every developer knows the cycle: open ChatGPT, type a prompt, copy the code, paste it into your editor, fix the errors, repeat. We spend more time managing AI output than actually building things.

Then I thought about developers who physically can't type for hours — those dealing with RSI, carpal tunnel, or accessibility challenges. The keyboard has been the only interface to programming for 50 years. Why?

I asked myself: what if coding felt as natural as having a conversation? What if you could speak your ideas, approve plans with a gesture, and even sketch components in the air?

That question became Vibe Architect.

What it does

Vibe Architect is a multimodal AI coding assistant with four input modes:

Voice — Say "Hey Vibe, create a login form" and the AI agent generates a full execution plan
Gestures — Thumbs up to approve, thumbs down to reject, peace sign to modify — all detected in real-time through your webcam
Air-Drawing — Point your finger and draw shapes in the air: circles become buttons, rectangles become cards, triangles become alerts
Text — Traditional CLI input for when you prefer typing

The AI doesn't just generate code blindly. It follows a human-in-the-loop workflow:

Voice Command → AI Plans → Human Reviews → Gesture Approval → Code Generated

Everything runs with full privacy — voice recognition is 100% local using Whisper, and gesture detection runs on-device through MediaPipe.

How I built it

The system is built across different layers:

The AI agent runs on LangGraph with a stateful workflow: Discovery → Planning → Human Review → Execution → Validation. PostgreSQL stores session history and Redis handles agent state checkpointing.

Multimodal Input Layer

Voice: Faster-Whisper runs locally for speech-to-text with custom wake word detection ("Hey Vibe")
Gestures: MediaPipe hand tracking classifies 7 gestures (thumbs up, thumbs down, victory, pinch, fist, open palm, point) using landmark analysis
Air-Drawing: OpenCV tracks fingertip movement, a shape detector classifies drawn paths into geometric shapes, which map to UI components
All three inputs feed into a unified WebSocket pipeline

What I learned

Multimodal interaction design is fundamentally different from single-input design — you have to handle conflicts (what if voice says "approve" while gesture says "reject"?)
Local ML inference is surprisingly fast — Whisper and MediaPipe both run in real-time on CPU without a GPU
Human-in-the-loop AI is not just a safety feature, it's a better UX — developers trust the output more when they explicitly approve the plan
Dependency management matters — a single version mismatch between MediaPipe, protobuf, and Python can break an entire system silently

What's next

VS Code extension for in-editor multimodal interaction
Custom gesture training — let users define their own gestures
Mobile companion app for remote gesture approval

Built With

fastapi
hitl
langgraph
openai
rich

Submitted to

Code Spring

Created by

Hassan Mehmood
Software Engineer ¶ Startups Builder ¶ Full Stack Developer (ML & Agentic AI) ¶ Section Leader @Stanford ¶ 7x Intl Hack Winner
Muhammad Jawad
I'm Data analyst, and enthusiasm for learning new techs and tools which change the modern era.
Muhammad Ibrahim Qasmi
Youngest Kaggle GrandMaster (3x, Global Rank #20) | Data Scientist | 5x international Hackathon Winner | AI Research Scientist
Tayyab Sajjad
IT Graduate | Python | Gen AI Apps Developer | CALICO Fa'24 | 2 x'Intel Hackathon Winner | Meta Hacker Cup Qualifier

Updates

Hassan Mehmood started this project — Feb 10, 2026 01:36 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.