CATCare

Inspiration

While many applications of AI seek to replace human decision-making, we believe the most useful systems are the ones that amplify human expertise with structure, traceability, and less paperwork. On heavy equipment inspections, the bottleneck is rarely “knowing what’s wrong”, it’s consistently capturing it: what part, what severity, what evidence (photo), and what follow-up.

What it does

CATCare is a voice-first inspection guide that can run on a helmet/suit-mounted phone or smart glasses (e.g., Meta Ray-Bans). As a technician performs inspection on a Caterpillar 950 Wheel Loader, CATCare listens to natural speech (“engine block is leaking”) and turns it into a structured, industry-style inspection checklist.

Under the hood, the Inspector (Generator) agent guides the technician through inspection zones (Ground, Engine, Cab Exterior, Cab Interior), asks targeted clarifying questions, and maps conversational findings into typed JSON fields with standardized status codes (GREEN / YELLOW / RED) plus notes and timestamps. Using smart localization based off of feature mapping models to get the relative position of the technician relative to the Wheel Loader, we can intelligently help guide the inspector through the inspection process without missing any critical information. Additionally, we uniquely incorporate real-time Audio-based anomaly detection using MFCC features to compare "good-condition" engine noise vs. anomalous engine noise in order to better understand the status of the engine. When a defect is reported, it automatically captures/attaches images and produces a complete report that can be exported into photo-rich PDF summaries.

For managers, the Reviewer agent supports natural-language review over historical inspections: query by serial number, do CRUD updates on past reports, surface degrading trends over time, and generate executive summaries. The inspection outputs also create labeled, structured data that can later be used to train automated defect detection models.

How we built it

Google GenAI + Google ADK (Gemini 2.5 Flash) - multi-agent system with two personas (Inspector/Generator + Manager/Reviewer), structured JSON tool outputs, clarification loops, report generation.

PyTorch, SuperPoint, LightGlue - computer vision for zone localization + frame-by-frame matching to align findings with machine regions.

Flutter + WebSockets + Silero VAD + Gemini STT - low-latency audio streaming, pause detection for natural turn-taking, iterative speech-to-text on raw audio bytes.

FastAPI + Uvicorn (ASGI) - backend sessions, image uploads, real-time voice channels, report persistence.

Firebase - storage for storing persistent photos, reports, historical inspection records.

Flutter Mobile App - field UI for live voice interaction, instant feedback, on-the-fly inspection logging.

Challenges we ran into

Processing multi-modal streams concurrently (audio + frame capture + agent reasoning) without adding latency.

Keeping the agentic workflow reliable end-to-end (voice turns -> JSON fields -> photo linkage -> final report) while avoiding hallucinated fields.

First time hacking !!! :D -> Learning how to leverage Gemini to learn fast and deploy fast

What's next for CATCare

Expand support across more of the Caterpillar fleet and inspection templates.

Use the self-labeled inspection data (speech + images + part/status labels) to train and deploy full-stack automated inspection/detection models, with humans in the loop for verification.

Integrate to more hardware such as meta rayban smart glasses and other custom camera set ups.