Skip to content

autoinspection-classification/GaugeFailClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

World Model Failure Classification for Autonomous Inspection

A machine learning framework for video-based failure detection and classification using world models with conformal prediction.

Overview

This project trains world models on video data to detect and classify failures during autonomous inspection tasks. It supports:

  • Anomaly detection (single-class) — detect when behavior deviates from normal
  • Uncertainty quantification via conformal prediction bands
  • Multiple OOD detection methods — Mahalanobis distance, reconstruction error, latent space metrics, and hybrid approaches
  • Real-time deployment via ROS integration

Directory Structure

GaugeFailClassification/
├── src/                            # Core library
│   ├── models/                     # Neural network models
│   │   ├── cosmos_world.py         # NVIDIA Cosmos-based world model
│   │   ├── cosmos_world_classifier.py  # Binary success/failure classifier
│   │   ├── latent_world.py         # Latent space world model
│   │   ├── svd_autoencoder.py      # SVD compression layer
│   │   ├── unet_autoencoder.py     # U-Net compression layer
│   │   ├── general.py              # Shared utilities and loss functions
│   │   └── legacy/                 # Latent normalizing flow (research)
│   ├── data/                       # Dataset loaders
│   │   ├── simple_datasets/        # Single/multi-video loaders (recommended)
│   │   └── complex_datasets/       # Multi-modal research datasets (JSON/parquet)
│   └── utils/                      # Config, logging, data processing helpers
│
├── scripts/                        # All runnable scripts, organized by purpose
│   ├── training/                   # Model training
│   ├── inference/                  # Classification, OOD scoring, conformal prediction
│   ├── evaluation/                 # Post-prediction analysis and metrics
│   ├── data_processing/            # Video conversion, frame extraction, trimming
│   ├── visualization/              # Histogram and distribution plots
│   └── ros/                        # ROS server/subscriber for real-time deployment
│
├── examples/                       # Example videos for testing (gitignored, see examples/README.md)
│   ├── success/                    # Normal/successful runs (train/ and test/)
│   ├── known_failure/              # Known failure cases (train/ and test/)
│   └── ood/                        # Out-of-distribution videos (testing only)
├── outputs/                        # Script outputs (gitignored)
├── pyproject.toml                  # Dependencies
└── uv.lock                        # Dependency lock file

Installation

Requires Python ~3.10, FFmpeg, and a CUDA-capable GPU (recommended).

# 1. Create and activate a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate

# 2. Install project dependencies
uv sync        # or: pip install -e .

# 3. Install NVIDIA Cosmos Tokenizer (required for the world model)
#    Clone outside this repo, pull LFS weights, then install
git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git /path/to/Cosmos-Tokenizer
cd /path/to/Cosmos-Tokenizer
# Install ffmpeg and git-lfs if not already installed
# macOS:  brew install ffmpeg git-lfs
# Linux:  sudo apt-get install -y ffmpeg git-lfs
git lfs pull
uv pip install -e .
cd -

Data Setup

Video Format

  • Format: .mp4 (preferred), .avi, .mov, or .mkv
  • Resolution: Any — frames are automatically resized to 384x208 by the data loaders
  • Length: Videos can be different lengths — the data loader handles each independently. Optionally, you can trim videos to keep only the last N frames, since the end of a video is typically where success or failure is determined:
# Optional: keep only the last 300 frames of each video (discards the beginning)
python scripts/data_processing/trim_videos.py --input_dir /path/to/raw/videos --output_dir /path/to/trimmed/videos --frames 300

Folder Structure

Organize your videos into the examples/ directory (video files are gitignored):

examples/
├── success/
│   ├── train/          # Success videos for training (auto-split into train/val)
│   └── test/           # Success videos for calibration (conformal prediction)
├── known_failure/
│   ├── train/          # Failure videos for training (binary classifier only)
│   └── test/           # Failure videos for evaluating detection
└── ood/                # Out-of-distribution videos (testing only, never trained on)
  • success/train/ — Videos of normal, successful behavior. The model learns what "normal" looks like from these.
  • success/test/ — Held-out success videos used to calibrate conformal prediction thresholds. Ideally these are separate from training videos so thresholds are computed on unseen data. If you don't have enough videos, you can reuse the training videos for calibration.
  • known_failure/ — Videos of known failure cases. Used for evaluating detection.
  • ood/ — Out-of-distribution videos (unusual environments, unseen failure modes). Testing only — never used for training.

See examples/README.md for more details.

Usage

Training

Train the world model on success videos (learns normal behavior):

python scripts/training/main_train.py \
    --train_data_dir examples/success/train \
    --val_data_dir examples/success/train \
    --batch_size 1 --epochs 10 --frame_skip 10

Loss function: The model minimizes a combined loss over frame pairs (f_t, f_{t+1}):

loss = reconstruction_error + reconstruction_error_delta + 0.5 * hybrid_anomaly_score
  • Reconstruction error — MSE + SSIM between the predicted next frame and the actual next frame. This is the primary training signal.
  • Reconstruction error delta — Difference between next-frame reconstruction error and cross-step error (predicting the current frame). Encourages the model to predict forward in time rather than copying the input.
  • Hybrid anomaly score — Latent prediction error + perceptual error + center error. During training on success videos this should be low; at inference time, elevated scores indicate anomalies.

Training arguments:

Argument Description
--train_data_dir (Required) Directory of training videos
--val_data_dir Validation video directory (defaults to --train_data_dir if not set)
--batch_size Batch size (default: 32)
--epochs Number of training epochs (default: 10)
--frame_skip Sample every Nth frame (default: 1 = all frames). Higher values speed up training by skipping frames
--validation_stride Use every Nth frame pair for validation (default: 3). Higher values speed up validation
--num_workers DataLoader workers for training (default: 4)
--val_workers DataLoader workers for validation (default: 1)
--weight_decay Weight decay regularization (default: 1e-4)
--use_unet Use UNet compression layer
--use_svd Use SVD compression layer with optional ratio (e.g., --use_svd 0.5)
--reconstruct_dir Optional: reconstruct videos from this directory after training as a sanity check
--model_checkpoint Resume training from an existing checkpoint
--wandb_entity / --wandb_project Enable Weights & Biases logging

Inference

Inference is a two-step process: (1) calibrate CP bands from success videos, then (2) score test videos against those bands.

Important: Use the same --frame_skip value for training, calibration, and classification. The model learns temporal patterns at a specific frame interval during training — using a different interval at inference will produce unreliable scores.

Step 1: Calibrate — Compute conformal prediction bands from calibration (success) videos:

python scripts/inference/calibrate.py \
    --model_checkpoint path/to/model.ckpt \
    --calibration_dir examples/success/test \
    --output_dir cp_bands \
    --frame_skip 10

This saves thresholds, calibration statistics, and per-metric scores to cp_bands/. Use --metrics to select specific metrics (default: all 7):

Available metrics: reconstruction_error, training_loss, mahalanobis, l2_to_mean, cosine_to_mean, latent_pred_error, latent_std

Argument Description
--model_checkpoint (Required) Path to trained model checkpoint
--calibration_dir (Required) Directory of calibration (success) videos
--output_dir Output directory for CP bands (default: cp_bands)
--metrics Comma-separated metric names, or all (default: all)
--alpha Significance level; 0.05 = 95th percentile threshold (default: 0.05)
--frame_skip Frame skip interval, should match training (default: 1)
--window_size Sliding window for windowed metrics (default: 10)

Step 2: Classify — Score test videos for OOD-ness using the saved CP bands:

python scripts/inference/classify.py \
    --model_checkpoint path/to/model.ckpt \
    --test_dir examples/known_failure/test \
    --bands_dir cp_bands \
    --output_dir ood_results
Argument Description
--model_checkpoint (Required) Path to trained model checkpoint
--test_dir (Required) Directory of test videos to score
--bands_dir (Required) CP bands directory from calibrate.py
--output_dir Output directory for results (default: ood_results)
--metrics Score only specific metrics (default: all from bands_dir)

Each metric produces a results.csv and a distribution.png histogram. In the histogram, scores to the left of the red threshold line are normal (not OOD), and scores to the right are flagged as OOD.

Optional: Recalculate threshold — Adjust the CP threshold at a different quantile without re-running calibration:

python scripts/inference/recalculate_threshold.py \
    cp_bands/reconstruction_error 0.90

Takes a metric's CP bands directory and a new quantile (e.g., 0.90 for 90th percentile), recomputes the threshold from saved calibration scores, and re-flags test videos.

Optional: Per-frame visualization — Plot per-frame scores for a single video against CP thresholds to see exactly when/where the threshold is exceeded:

python scripts/inference/classification_timeseries.py \
    --model_checkpoint path/to/model.ckpt \
    --video path/to/video.mp4 \
    --bands_dir cp_bands \
    --output_dir visualization_outputs

Produces a time-series plot (<video_name>_scores.png) with one subplot per metric showing scores over time, the threshold line, and shaded regions where the threshold is exceeded. Also saves per-frame scores as CSV. Use --metrics to plot only specific metrics.

Evaluation

Calculate detection rates — Compute detection rates from classify.py output:

python scripts/evaluation/calculate_detection_rate.py --results_dir ood_results -s

Scans the results directory for per-metric results.csv files and reports detection rates. Use -s for a summary table, -v for detailed output, or --input_file for a single file.

Compare detection overlap — Compare which videos are flagged as OOD across two different metrics:

python scripts/evaluation/compare_detection_overlap.py \
    ood_results/reconstruction_error/results.csv \
    ood_results/mahalanobis/results.csv \
    --key basename -v

Reports overlap statistics (Jaccard index, both/only-1/only-2/neither buckets). Use --list to print video names per bucket, --outdir to save bucket CSVs.

Compare detection timing — Measure how early or late the OOD detection is relative to human-labeled failure frames:

python scripts/evaluation/compare_detection_timing.py \
    --true_failures true_failures.csv \
    --detected_spikes ood_results/reconstruction_error/results.csv \
    --video_dir examples/known_failure/test \
    --output_dir timing_analysis \
    --frame_skip 10

Inputs:

  • --true_failures: CSV with columns video_name, true_failure_frames (human-labeled frame numbers)
  • --detected_spikes: Per-metric results.csv from classify.py (uses the exceeding_frames column)
  • --video_dir: Directory containing the videos (for FPS lookup)
  • --frame_skip: Frame skip rate used during inference (default: 1)
  • --use_first_detection: Optional flag to use the first detected spike instead of the closest

Outputs signed timing differences (negative = early detection, positive = late detection) with per-video and summary statistics.

Visualization

Plot score distributions — Compare calibration vs test score distributions with optional threshold line. Supports TikZ/pgf export for LaTeX papers:

python scripts/visualization/plot_score_distributions.py \
    --calibration_file cp_bands/reconstruction_error/calibration_scores.npy \
    --test1_csv ood_results/reconstruction_error/results.csv \
    --test2_csv ood_results_failure/reconstruction_error/results.csv \
    --threshold_file cp_bands/reconstruction_error/threshold.txt \
    --model_type success \
    --output_file score_distributions.png

Use --tikz for TikZ/pgf export, --model_type success|failure for automatic labeling (Nominal/OOD/Failure), and --bins, --alpha for plot customization.

Scripts Reference

Folder Purpose Key Scripts
training/ Model training main_train.py
inference/ OOD scoring and detection calibrate.py, classify.py, recalculate_threshold.py, classification_timeseries.py
evaluation/ Analyze results calculate_detection_rate.py, compare_detection_overlap.py, compare_detection_timing.py
data_processing/ Prepare video data convert_rosbags.py, trim_videos.py, frames_to_video.py, convert_frames_to_seconds.py
visualization/ Standalone plots plot_score_distributions.py
ros/ Real-time deployment fail_server.py, fail_subscriber.py

Models

  • CosmosWorld — World model built on NVIDIA Cosmos tokenizer for video prediction and anomaly detection
  • SVD/UNet Autoencoders — Optional compression layers for the latent space

About

Repository for ICRA 2026 paper "World Model Failure Classification and Anomaly Detection for Autonomous Inspection" by Ho et al.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors