Multimodal AI for Equipment Inspection, Operator Guidance, and Aftermarket Integration
HackIL 2026 Caterpillar Challenge | Raghav Tirumale, Ram Reddy, Neil Deo, Ved Vyas
- Offline-First On-Device Architecture: We built a local agent (
agent_local.py) running Moondream, Ollama (lfm2.5-thinking), and Faster-Whisper. Combined with our app's robust local state persistence, inspections continue flawlessly even in zero-connectivity environments. - Production-Ready Mobile Client: A fully functioning React Native app for iOS/Android. Live camera, narration splicing, acoustic capture, and automated AI adjudication are fully integrated end-to-end.
- Multimodal Depth: Beyond vision, we fuse acoustics and speech. A trained
ConstructionExpertmodel supplements foundation model reasoning, and contrastive audio embeddings detect engine anomalies.
Ensuring machine uptime by providing an operator-first solution that uses multimodal AI to close the loop between the operator, the technician, and the CAT parts warehouse.
We know the realities of a Caterpillar job site. We built an architecture that respects the physics, network constraints, and conditions of remote deployment.
1. How is it "Offline-First" if you use GPT-4o?
We don't rely on cloud models in a tunnel. Our repository includes a fully local inference node (
agent_local.py) designed for ruggedized edge hardware (e.g., Toughbook in a service truck). We use a local Moondream Station for vision, Ollama for adjudication, and Faster-Whisper (CTranslate2) running in local int8 mode for fast transcription. Even if the edge node disconnects, our mobile client uses a local Zustand store to cache state. The operator captures frames and splices audio completely on-device, and the app syncs the moment connectivity returns.
2. How does Acoustic Analysis work over deafening background noise?
We don't compress audio. We lock recording to 48kHz, 16-bit LINEAR PCM, avoiding lossy MP3 compression that destroys key frequency bands. We window the signal and embed it with CLAP, comparing it against a machine-specific tensor (e.g.,
baseline_3066t.npz for a specific idling engine). Because the contrastive embedding maps semantic acoustic space, steady-state ambient noise (like rock crushers) is normalized out against the healthy baseline, isolating the novelty of an internal engine knock.
3. How does the Parts Picker avoid blind guessing against millions of SKUs?
We don't do blind similarity searches. The app knows the exact Machine Family from the inspection setup state. When an image is captured, GPT-4o-mini extracts a structured physical profile: geometry, material, interface type, and crucially, subsystem context. This massively constrains the search space. We run vector search against
cat_parts_characteristics.jsonl to grab the top candidates, then an LLM ranker strictly evaluates fitment to return the top 3 high-confidence matches.
4. How do you prevent operators from poisoning the "Data Moat"?
Operators won't blindly click "Accept All" because CAugmenT forces active engagement. If a 'Critical' severity failure is detected, the app pauses audio recording, triggers a haptic warning, and uses Text-to-Speech to physically speak the lockout/tagout procedure into their headset. Furthermore, our
ConstructionExpert retraining pipeline explicitly mandates that operator-labeled data is only promoted to production if it improves the F1 score and localization precision on a held-out, master-mechanic-reviewed validation set.
5. Will this app melt a 4-year-old Android device in the Texas heat?
No. We use a strictly event-driven architecture. We don't run continuous 60fps video inference on the device. We only pull a high-res still frame when the operator taps to capture. Audio is continuously recorded, but on capture, it splices the exact segment for that frame and restarts. The phone acts as a lightweight capture terminal, offloading the heavy thermal processing to the edge node or cloud.
The operator narrates while capturing frames. The client splices audio so each frame pairs exactly with its narration segment. The backend transcribes, computes a pass/fail statistical hint via a lightweight CNN, and evaluates the frame/transcript against machine-specific checklist context. If a safety issue is detected, the app pauses and speaks lockout/tagout guidance. The final session compiles into a Cat Inspect-compatible PDF.
The operator captures a damaged component. A physical profile (geometry, material, color, condition, interface type) is extracted. This queries a vector store of Caterpillar parts, retrieving up to 240 candidates. An LLM ranker surfaces the top matches with fitment rationale and direct dealer links -- moving from detection to ordering in one tap.
The operator records a high-fidelity WAV clip of a running engine. The backend windows the signal, embeds it with CLAP, and calculates Euclidean distance against a healthy baseline tensor. Anomalies (e.g., metallic knocking) are flagged with semantic text prompts, catching degradation that visual inspections miss.
| Metric | How CAugmenT Delivers |
|---|---|
| Inspection Time | Removes manual paperwork; produces Cat Inspect-compatible PDFs automatically. |
| Aftermarket Revenue | Turns detected faults into direct part procurement links, supporting Cat's $28B services target. |
| Safety Incidents | Real-time spoken guidance for lockout/tagout on critical findings. |
| Workforce Scalability | Encodes veteran mechanic expertise, raising the diagnostic quality of junior operators. |
- Industrial UX: 54pt touch targets, haptic feedback, dark mode for high-glare environments.
- Local State:
Zustandpersists inspection findings. Network drops don't interrupt workflow. - Media Splicing:
expo-avefficiently splices high-fidelity audio chunks per frame.
- Transcription:
Faster Whisper(CTranslate2) - Vision/Reasoning:
GPT-4o Vision(Cloud) ORMoondream+Ollama(Edge) - Acoustic Inference:
CLAP(laion/clap) embeddings and novelty detection - Specialized Vision: Custom
ConstructionExperttransformer trained on construction imagery (channel/spatial attention stages) - Reporting: Programmatic safety PDF form generation (
pdfmaker.py)
- Node.js 18+
- Python 3.10+
- Expo Go on iOS or Android
npm install
chmod +x scripts/download-fonts.sh && ./scripts/download-fonts.sh
npm startScan the QR code with Expo Go.
python gptagent.py --host 0.0.0.0 --port 8081python agent_local.py --host 0.0.0.0 --port 8080 --ollama-model lfm2.5-thinkingFor detailed endpoint documentation, refer to app/GPTAGENT_API_REFERENCE.md.
Built for the HackIL 2026 Caterpillar Challenge.



