CAugmenT: Omni Inspect

Multimodal AI for Equipment Inspection, Operator Guidance, and Aftermarket Integration

HackIL 2026 Caterpillar Challenge | Raghav Tirumale, Ram Reddy, Neil Deo, Ved Vyas

Critical Highlights

Offline-First On-Device Architecture: We built a local agent (agent_local.py) running Moondream, Ollama (lfm2.5-thinking), and Faster-Whisper. Combined with our app's robust local state persistence, inspections continue flawlessly even in zero-connectivity environments.
Production-Ready Mobile Client: A fully functioning React Native app for iOS/Android. Live camera, narration splicing, acoustic capture, and automated AI adjudication are fully integrated end-to-end.
Multimodal Depth: Beyond vision, we fuse acoustics and speech. A trained ConstructionExpert model supplements foundation model reasoning, and contrastive audio embeddings detect engine anomalies.

What is CAugmenT?

Ensuring machine uptime by providing an operator-first solution that uses multimodal AI to close the loop between the operator, the technician, and the CAT parts warehouse.

Engineering Reality: Answering the Hard Questions

We know the realities of a Caterpillar job site. We built an architecture that respects the physics, network constraints, and conditions of remote deployment.

1. How is it "Offline-First" if you use GPT-4o?

We don't rely on cloud models in a tunnel. Our repository includes a fully local inference node (agent_local.py) designed for ruggedized edge hardware (e.g., Toughbook in a service truck). We use a local Moondream Station for vision, Ollama for adjudication, and Faster-Whisper (CTranslate2) running in local int8 mode for fast transcription. Even if the edge node disconnects, our mobile client uses a local Zustand store to cache state. The operator captures frames and splices audio completely on-device, and the app syncs the moment connectivity returns.

2. How does Acoustic Analysis work over deafening background noise?

We don't compress audio. We lock recording to 48kHz, 16-bit LINEAR PCM, avoiding lossy MP3 compression that destroys key frequency bands. We window the signal and embed it with CLAP, comparing it against a machine-specific tensor (e.g., baseline_3066t.npz for a specific idling engine). Because the contrastive embedding maps semantic acoustic space, steady-state ambient noise (like rock crushers) is normalized out against the healthy baseline, isolating the novelty of an internal engine knock.

3. How does the Parts Picker avoid blind guessing against millions of SKUs?

We don't do blind similarity searches. The app knows the exact Machine Family from the inspection setup state. When an image is captured, GPT-4o-mini extracts a structured physical profile: geometry, material, interface type, and crucially, subsystem context. This massively constrains the search space. We run vector search against cat_parts_characteristics.jsonl to grab the top candidates, then an LLM ranker strictly evaluates fitment to return the top 3 high-confidence matches.

4. How do you prevent operators from poisoning the "Data Moat"?

Operators won't blindly click "Accept All" because CAugmenT forces active engagement. If a 'Critical' severity failure is detected, the app pauses audio recording, triggers a haptic warning, and uses Text-to-Speech to physically speak the lockout/tagout procedure into their headset. Furthermore, our ConstructionExpert retraining pipeline explicitly mandates that operator-labeled data is only promoted to production if it improves the F1 score and localization precision on a held-out, master-mechanic-reviewed validation set.

5. Will this app melt a 4-year-old Android device in the Texas heat?

No. We use a strictly event-driven architecture. We don't run continuous 60fps video inference on the device. We only pull a high-res still frame when the operator taps to capture. Audio is continuously recorded, but on capture, it splices the exact segment for that frame and restarts. The phone acts as a lightweight capture terminal, offloading the heavy thermal processing to the edge node or cloud.

The Workflows

1. Multimodal Inspection

The operator narrates while capturing frames. The client splices audio so each frame pairs exactly with its narration segment. The backend transcribes, computes a pass/fail statistical hint via a lightweight CNN, and evaluates the frame/transcript against machine-specific checklist context. If a safety issue is detected, the app pauses and speaks lockout/tagout guidance. The final session compiles into a Cat Inspect-compatible PDF.

Multimodal Inspection: Live camera capture with AI adjudication and real-time inspection status.

2. Visual Parts Picker

The operator captures a damaged component. A physical profile (geometry, material, color, condition, interface type) is extracted. This queries a vector store of Caterpillar parts, retrieving up to 240 candidates. An LLM ranker surfaces the top matches with fitment rationale and direct dealer links -- moving from detection to ordering in one tap.

Parts Picker: Capture a damaged component and receive ranked Cat part matches with dealer links.

3. Acoustic Engine Analysis

The operator records a high-fidelity WAV clip of a running engine. The backend windows the signal, embeds it with CLAP, and calculates Euclidean distance against a healthy baseline tensor. Anomalies (e.g., metallic knocking) are flagged with semantic text prompts, catching degradation that visual inspections miss.

Engine Analyzer: Record engine audio and receive CLAP-based anomaly detection results.

Business Outcomes

Metric	How CAugmenT Delivers
Inspection Time	Removes manual paperwork; produces Cat Inspect-compatible PDFs automatically.
Aftermarket Revenue	Turns detected faults into direct part procurement links, supporting Cat's $28B services target.
Safety Incidents	Real-time spoken guidance for lockout/tagout on critical findings.
Workforce Scalability	Encodes veteran mechanic expertise, raising the diagnostic quality of junior operators.

Architecture & Stack

Client (React Native / Expo)

Industrial UX: 54pt touch targets, haptic feedback, dark mode for high-glare environments.
Local State: Zustand persists inspection findings. Network drops don't interrupt workflow.
Media Splicing: expo-av efficiently splices high-fidelity audio chunks per frame.

Backend (`gptagent.py` / Local Node `agent_local.py`)

Transcription: Faster Whisper (CTranslate2)
Vision/Reasoning: GPT-4o Vision (Cloud) OR Moondream + Ollama (Edge)
Acoustic Inference: CLAP (laion/clap) embeddings and novelty detection
Specialized Vision: Custom ConstructionExpert transformer trained on construction imagery (channel/spatial attention stages)
Reporting: Programmatic safety PDF form generation (pdfmaker.py)

Quick Start

Prerequisites

Node.js 18+
Python 3.10+
Expo Go on iOS or Android

Mobile App

npm install
chmod +x scripts/download-fonts.sh && ./scripts/download-fonts.sh
npm start

Scan the QR code with Expo Go.

Backend Server (Cloud Model Integration)

python gptagent.py --host 0.0.0.0 --port 8081

Local / Offline Agent (Edge Deployment)

python agent_local.py --host 0.0.0.0 --port 8080 --ollama-model lfm2.5-thinking

For detailed endpoint documentation, refer to app/GPTAGENT_API_REFERENCE.md.

Built for the HackIL 2026 Caterpillar Challenge.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
agent		agent
app		app
app_demo_imgs		app_demo_imgs
img		img
.gitignore		.gitignore
README.md		README.md
SLIDES.md		SLIDES.md
agent_local.py		agent_local.py
rep.pdf		rep.pdf
rep.tex		rep.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAugmenT: Omni Inspect

Critical Highlights

What is CAugmenT?

Engineering Reality: Answering the Hard Questions

The Workflows

1. Multimodal Inspection

2. Visual Parts Picker

3. Acoustic Engine Analysis

Business Outcomes

Architecture & Stack

Client (React Native / Expo)

Backend (`gptagent.py` / Local Node `agent_local.py`)

Quick Start

Prerequisites

Mobile App

Backend Server (Cloud Model Integration)

Local / Offline Agent (Edge Deployment)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CAugmenT: Omni Inspect

Critical Highlights

What is CAugmenT?

Engineering Reality: Answering the Hard Questions

The Workflows

1. Multimodal Inspection

2. Visual Parts Picker

3. Acoustic Engine Analysis

Business Outcomes

Architecture & Stack

Client (React Native / Expo)

Backend (gptagent.py / Local Node agent_local.py)

Quick Start

Prerequisites

Mobile App

Backend Server (Cloud Model Integration)

Local / Offline Agent (Edge Deployment)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Backend (`gptagent.py` / Local Node `agent_local.py`)

Packages