Inspiration

Every shift on a construction site starts the same way an inspector walks a machine with a clipboard, checks boxes, writes notes by hand, and files a report that might take 20 minutes per machine. Damage gets missed, orders get delayed, and when heavy equipment goes down unexpectedly, the cost isn't just the repair but it's also every idle hour on a job site waiting for a part that should have been ordered three days ago.

We wanted to fix this entire process by not just digitizing the form but by eliminating it entirely.


What It Does

Symbiote is a voice agent orchestration platform for heavy equipment inspection. An inspector walks up to a machine, speaks what they see, and points their phone camera at it. The rest is automatic.

  • Voice input triggers the pipeline: no tapping, no typing, no menu navigation
  • An AI routing layer identifies the component from the description and maps it to the correct inspection knowledge base
  • Qwen2-VL-7B analyzes the image against reference good-condition images and equipment blueprints, detecting anomalies with severity classifications including Monitor, Normal, Pass, or Fail
  • Matched parts are pulled from inventory automatically based on what was found (like missing parts)
  • Gemini Live reads the findings back in plain language and asks, "Do you want me to place the order?"
  • On confirmation, a PDF report is generated and stored with a full audit trail of equipment ID, inspector ID, timestamp, findings, and order status

The entire loop with image capture, AI analysis, inventory check, order confirmation, and report upload happens inside a single voice conversation. No screen, no hands, no help required.


How We Built It

Voice Orchestration Layer with Gemini Live API

Gemini 2.5 Flash runs the conversation. It listens to the inspector, decides when to trigger an inspection, calls our backend endpoints as tools mid-conversation, and speaks the results back. Function calling in the Live API is what makes the hands-free loop possible, and Gemini doesn't just answer questions but it takes actions side by side.

AI Inspection Core - Modal.com + Qwen2-VL

We run Qwen2-VL-7B-Instruct on Modal's A10G GPU. Each inspection passes up to three images to the model: a reference good-condition image, a component blueprint, and the live inspection photo. Comparing against known-good references gives the model grounding and dramatically reduces false positives. A custom routing layer identifies the correct component before analysis runs - loading the wrong knowledge base was the core failure mode we had to solve early.

Data Layer - Supabase

PostgreSQL handles inspections, fleet registry, task templates, inventory, and order cart. We built a Todo → Task branching system: when an inspection starts, it automatically generates the correct checklist for that machine type. Reports are stored as PDFs in Supabase Storage with public URLs returned in the inspection response.

Mobile - Swift iOS

The Swift app streams microphone audio over WebSocket to our Python orchestration server, which bridges to Gemini Live API. When an inspection fires, the app sends the camera frame and receives structured events back - cart preview, order confirmation, report URL - to update the UI in real time.

Orchestration Server - Python + WebSockets

A lightweight Python server sits between Swift and Gemini. It handles the image handoff, calls the Modal inspection endpoints, trims results for speech, and emits UI events back to the app. This keeps all intelligence server-side and the mobile client thin.

Admin Dashboard - React + JavaScript

We built a secure, role-restricted admin dashboard for managers and fleet operators, developed in React and JavaScript and connected directly to our Supabase data layer While inspectors operate hands-free in the field, the dashboard serves as the operational control center - providing a fleet-wide overview, health and severity trend analysis, inventory and part stock visibility, machine specifications management, order cart review, and a searchable archive of generated inspection reports. This layer gives leadership full oversight and control while keeping the inspection experience entirely voice-driven.


Challenges We Ran Into

Routing accuracy was our first real failure. Early tests showed the model misidentifying a damaged access ladder as a tire component because it was given the wrong prompt. We built a priority-based routing system with explicit hint matching, damage description inference, LLM classification, and misclassification overrides. Explicit user input always wins over model classification.

Token budget management was critical for vision models. Full subsection knowledge prompts crowded out the actual image analysis. We trimmed prompts to extract only RED/YELLOW/GREEN detection rules and capped them at 2,000 characters before passing to the model.

The order confirmation loop had to be split into two stages - preview and commit - so that orders are never placed without the inspector explicitly saying yes. A dropped connection or a closed app should never result in phantom orders in the system.

Multi-image prompting required careful ordering. Inspection image last, reference images first. The model's attention focuses on the most recent content - getting this order wrong produced generic answers instead of meaningful comparisons.


Accomplishments That We're Proud Of

  • A fully hands-free inspection loop that works end to end in under 60 seconds
  • Reference image comparison that gives the model visual grounding instead of relying on pure LLM knowledge
  • A two-stage order confirmation system that respects inspector authority - the AI recommends, the human decides
  • A routing layer that solved the specific failure modes present in the provided test data
  • Voice-to-report in a single conversation with no UI interaction required

What We Learned

  • Routing matters more than model quality. The best vision model gives wrong answers with the wrong prompt. Grounding the LLM in component-specific knowledge before the image is analyzed was the difference between a demo and something that actually works.

  • Voice interfaces change the design constraints entirely. You can't show a dropdown. You can't ask the user to scroll. Every interaction has to be completable in one sentence. That forces clarity in ways that screen-based design doesn't.

  • Split your pipeline at the confirmation boundary. Anything that writes to a database or places an order should require an explicit human signal. Don't assume intent from context.


What's Next for Symbiote

Predictive failure scoring - pulling inspection history per machine and projecting when a component will reach critical based on the rate of degradation. Move from reactive diagnosis to proactive prevention.

Expanded equipment support - the routing and prompt architecture is equipment-agnostic. Any machine with a parts catalog and a set of inspection categories can be onboarded.

Direct dealer integration - connecting the order cart to CAT dealer parts APIs so confirmed orders flow directly into procurement without any manual re-entry.

Built With

Share this project:

Updates