Inspiration

Every developer knows the cycle: open ChatGPT, type a prompt, copy the code, paste it into your editor, fix the errors, repeat. We spend more time managing AI output than actually building things.

Then I thought about developers who physically can't type for hours — those dealing with RSI, carpal tunnel, or accessibility challenges. The keyboard has been the only interface to programming for 50 years. Why?

I asked myself: what if coding felt as natural as having a conversation? What if you could speak your ideas, approve plans with a gesture, and even sketch components in the air?

That question became Vibe Architect.

What it does

Vibe Architect is a multimodal AI coding assistant with four input modes:

  • Voice — Say "Hey Vibe, create a login form" and the AI agent generates a full execution plan
  • Gestures — Thumbs up to approve, thumbs down to reject, peace sign to modify — all detected in real-time through your webcam
  • Air-Drawing — Point your finger and draw shapes in the air: circles become buttons, rectangles become cards, triangles become alerts
  • Text — Traditional CLI input for when you prefer typing

The AI doesn't just generate code blindly. It follows a human-in-the-loop workflow:

Voice Command → AI Plans → Human Reviews → Gesture Approval → Code Generated

Everything runs with full privacy — voice recognition is 100% local using Whisper, and gesture detection runs on-device through MediaPipe.

How I built it

The system is built across different layers:

The AI agent runs on LangGraph with a stateful workflow: Discovery → Planning → Human Review → Execution → Validation. PostgreSQL stores session history and Redis handles agent state checkpointing.

Multimodal Input Layer

  • Voice: Faster-Whisper runs locally for speech-to-text with custom wake word detection ("Hey Vibe")
  • Gestures: MediaPipe hand tracking classifies 7 gestures (thumbs up, thumbs down, victory, pinch, fist, open palm, point) using landmark analysis
  • Air-Drawing: OpenCV tracks fingertip movement, a shape detector classifies drawn paths into geometric shapes, which map to UI components
  • All three inputs feed into a unified WebSocket pipeline

What I learned

  • Multimodal interaction design is fundamentally different from single-input design — you have to handle conflicts (what if voice says "approve" while gesture says "reject"?)
  • Local ML inference is surprisingly fast — Whisper and MediaPipe both run in real-time on CPU without a GPU
  • Human-in-the-loop AI is not just a safety feature, it's a better UX — developers trust the output more when they explicitly approve the plan
  • Dependency management matters — a single version mismatch between MediaPipe, protobuf, and Python can break an entire system silently

What's next

  • VS Code extension for in-editor multimodal interaction
  • Custom gesture training — let users define their own gestures
  • Mobile companion app for remote gesture approval

Built With

  • fastapi
  • hitl
  • langgraph
  • openai
  • rich
Share this project:

Updates