Inspiration
Every developer knows the cycle: open ChatGPT, type a prompt, copy the code, paste it into your editor, fix the errors, repeat. We spend more time managing AI output than actually building things.
Then I thought about developers who physically can't type for hours — those dealing with RSI, carpal tunnel, or accessibility challenges. The keyboard has been the only interface to programming for 50 years. Why?
I asked myself: what if coding felt as natural as having a conversation? What if you could speak your ideas, approve plans with a gesture, and even sketch components in the air?
That question became Vibe Architect.
What it does
Vibe Architect is a multimodal AI coding assistant with four input modes:
- Voice — Say "Hey Vibe, create a login form" and the AI agent generates a full execution plan
- Gestures — Thumbs up to approve, thumbs down to reject, peace sign to modify — all detected in real-time through your webcam
- Air-Drawing — Point your finger and draw shapes in the air: circles become buttons, rectangles become cards, triangles become alerts
- Text — Traditional CLI input for when you prefer typing
The AI doesn't just generate code blindly. It follows a human-in-the-loop workflow:
Voice Command → AI Plans → Human Reviews → Gesture Approval → Code Generated
Everything runs with full privacy — voice recognition is 100% local using Whisper, and gesture detection runs on-device through MediaPipe.
How I built it
The system is built across different layers:
The AI agent runs on LangGraph with a stateful workflow: Discovery → Planning → Human Review → Execution → Validation. PostgreSQL stores session history and Redis handles agent state checkpointing.
Multimodal Input Layer
- Voice: Faster-Whisper runs locally for speech-to-text with custom wake word detection ("Hey Vibe")
- Gestures: MediaPipe hand tracking classifies 7 gestures (thumbs up, thumbs down, victory, pinch, fist, open palm, point) using landmark analysis
- Air-Drawing: OpenCV tracks fingertip movement, a shape detector classifies drawn paths into geometric shapes, which map to UI components
- All three inputs feed into a unified WebSocket pipeline
What I learned
- Multimodal interaction design is fundamentally different from single-input design — you have to handle conflicts (what if voice says "approve" while gesture says "reject"?)
- Local ML inference is surprisingly fast — Whisper and MediaPipe both run in real-time on CPU without a GPU
- Human-in-the-loop AI is not just a safety feature, it's a better UX — developers trust the output more when they explicitly approve the plan
- Dependency management matters — a single version mismatch between MediaPipe, protobuf, and Python can break an entire system silently
What's next
- VS Code extension for in-editor multimodal interaction
- Custom gesture training — let users define their own gestures
- Mobile companion app for remote gesture approval
Built With
- fastapi
- hitl
- langgraph
- openai
- rich
Log in or sign up for Devpost to join the conversation.