Skip to content

ramnreddy15/CAugmenT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Caterpillar Logo

CAugmenT: Omni Inspect

Multimodal AI for Equipment Inspection, Operator Guidance, and Aftermarket Integration

HackIL 2026 Caterpillar Challenge | Raghav Tirumale, Ram Reddy, Neil Deo, Ved Vyas

CAugmenT App Overview

Critical Highlights

  • Offline-First On-Device Architecture: We built a local agent (agent_local.py) running Moondream, Ollama (lfm2.5-thinking), and Faster-Whisper. Combined with our app's robust local state persistence, inspections continue flawlessly even in zero-connectivity environments.
  • Production-Ready Mobile Client: A fully functioning React Native app for iOS/Android. Live camera, narration splicing, acoustic capture, and automated AI adjudication are fully integrated end-to-end.
  • Multimodal Depth: Beyond vision, we fuse acoustics and speech. A trained ConstructionExpert model supplements foundation model reasoning, and contrastive audio embeddings detect engine anomalies.

What is CAugmenT?

Ensuring machine uptime by providing an operator-first solution that uses multimodal AI to close the loop between the operator, the technician, and the CAT parts warehouse.


Engineering Reality: Answering the Hard Questions

We know the realities of a Caterpillar job site. We built an architecture that respects the physics, network constraints, and conditions of remote deployment.

1. How is it "Offline-First" if you use GPT-4o?
We don't rely on cloud models in a tunnel. Our repository includes a fully local inference node (agent_local.py) designed for ruggedized edge hardware (e.g., Toughbook in a service truck). We use a local Moondream Station for vision, Ollama for adjudication, and Faster-Whisper (CTranslate2) running in local int8 mode for fast transcription. Even if the edge node disconnects, our mobile client uses a local Zustand store to cache state. The operator captures frames and splices audio completely on-device, and the app syncs the moment connectivity returns.
2. How does Acoustic Analysis work over deafening background noise?
We don't compress audio. We lock recording to 48kHz, 16-bit LINEAR PCM, avoiding lossy MP3 compression that destroys key frequency bands. We window the signal and embed it with CLAP, comparing it against a machine-specific tensor (e.g., baseline_3066t.npz for a specific idling engine). Because the contrastive embedding maps semantic acoustic space, steady-state ambient noise (like rock crushers) is normalized out against the healthy baseline, isolating the novelty of an internal engine knock.
3. How does the Parts Picker avoid blind guessing against millions of SKUs?
We don't do blind similarity searches. The app knows the exact Machine Family from the inspection setup state. When an image is captured, GPT-4o-mini extracts a structured physical profile: geometry, material, interface type, and crucially, subsystem context. This massively constrains the search space. We run vector search against cat_parts_characteristics.jsonl to grab the top candidates, then an LLM ranker strictly evaluates fitment to return the top 3 high-confidence matches.
4. How do you prevent operators from poisoning the "Data Moat"?
Operators won't blindly click "Accept All" because CAugmenT forces active engagement. If a 'Critical' severity failure is detected, the app pauses audio recording, triggers a haptic warning, and uses Text-to-Speech to physically speak the lockout/tagout procedure into their headset. Furthermore, our ConstructionExpert retraining pipeline explicitly mandates that operator-labeled data is only promoted to production if it improves the F1 score and localization precision on a held-out, master-mechanic-reviewed validation set.
5. Will this app melt a 4-year-old Android device in the Texas heat?
No. We use a strictly event-driven architecture. We don't run continuous 60fps video inference on the device. We only pull a high-res still frame when the operator taps to capture. Audio is continuously recorded, but on capture, it splices the exact segment for that frame and restarts. The phone acts as a lightweight capture terminal, offloading the heavy thermal processing to the edge node or cloud.

The Workflows

1. Multimodal Inspection

The operator narrates while capturing frames. The client splices audio so each frame pairs exactly with its narration segment. The backend transcribes, computes a pass/fail statistical hint via a lightweight CNN, and evaluates the frame/transcript against machine-specific checklist context. If a safety issue is detected, the app pauses and speaks lockout/tagout guidance. The final session compiles into a Cat Inspect-compatible PDF.

Multimodal Inspection
Multimodal Inspection: Live camera capture with AI adjudication and real-time inspection status.

2. Visual Parts Picker

The operator captures a damaged component. A physical profile (geometry, material, color, condition, interface type) is extracted. This queries a vector store of Caterpillar parts, retrieving up to 240 candidates. An LLM ranker surfaces the top matches with fitment rationale and direct dealer links -- moving from detection to ordering in one tap.

Visual Parts Picker
Parts Picker: Capture a damaged component and receive ranked Cat part matches with dealer links.

3. Acoustic Engine Analysis

The operator records a high-fidelity WAV clip of a running engine. The backend windows the signal, embeds it with CLAP, and calculates Euclidean distance against a healthy baseline tensor. Anomalies (e.g., metallic knocking) are flagged with semantic text prompts, catching degradation that visual inspections miss.

Engine Acoustic Analyzer
Engine Analyzer: Record engine audio and receive CLAP-based anomaly detection results.

Business Outcomes

Metric How CAugmenT Delivers
Inspection Time Removes manual paperwork; produces Cat Inspect-compatible PDFs automatically.
Aftermarket Revenue Turns detected faults into direct part procurement links, supporting Cat's $28B services target.
Safety Incidents Real-time spoken guidance for lockout/tagout on critical findings.
Workforce Scalability Encodes veteran mechanic expertise, raising the diagnostic quality of junior operators.

Architecture & Stack

Client (React Native / Expo)

  • Industrial UX: 54pt touch targets, haptic feedback, dark mode for high-glare environments.
  • Local State: Zustand persists inspection findings. Network drops don't interrupt workflow.
  • Media Splicing: expo-av efficiently splices high-fidelity audio chunks per frame.

Backend (gptagent.py / Local Node agent_local.py)

  • Transcription: Faster Whisper (CTranslate2)
  • Vision/Reasoning: GPT-4o Vision (Cloud) OR Moondream + Ollama (Edge)
  • Acoustic Inference: CLAP (laion/clap) embeddings and novelty detection
  • Specialized Vision: Custom ConstructionExpert transformer trained on construction imagery (channel/spatial attention stages)
  • Reporting: Programmatic safety PDF form generation (pdfmaker.py)

Quick Start

Prerequisites

  • Node.js 18+
  • Python 3.10+
  • Expo Go on iOS or Android

Mobile App

npm install
chmod +x scripts/download-fonts.sh && ./scripts/download-fonts.sh
npm start

Scan the QR code with Expo Go.

Backend Server (Cloud Model Integration)

python gptagent.py --host 0.0.0.0 --port 8081

Local / Offline Agent (Edge Deployment)

python agent_local.py --host 0.0.0.0 --port 8080 --ollama-model lfm2.5-thinking

For detailed endpoint documentation, refer to app/GPTAGENT_API_REFERENCE.md.


Built for the HackIL 2026 Caterpillar Challenge.

About

Multimodal AI for Equipment Inspection, Operator Guidance, and Aftermarket Integration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors