World Model Failure Classification for Autonomous Inspection

A machine learning framework for video-based failure detection and classification using world models with conformal prediction.

Overview

This project trains world models on video data to detect and classify failures during autonomous inspection tasks. It supports:

Anomaly detection (single-class) — detect when behavior deviates from normal
Uncertainty quantification via conformal prediction bands
Multiple OOD detection methods — Mahalanobis distance, reconstruction error, latent space metrics, and hybrid approaches
Real-time deployment via ROS integration

Directory Structure

GaugeFailClassification/
├── src/                            # Core library
│   ├── models/                     # Neural network models
│   │   ├── cosmos_world.py         # NVIDIA Cosmos-based world model
│   │   ├── cosmos_world_classifier.py  # Binary success/failure classifier
│   │   ├── latent_world.py         # Latent space world model
│   │   ├── svd_autoencoder.py      # SVD compression layer
│   │   ├── unet_autoencoder.py     # U-Net compression layer
│   │   ├── general.py              # Shared utilities and loss functions
│   │   └── legacy/                 # Latent normalizing flow (research)
│   ├── data/                       # Dataset loaders
│   │   ├── simple_datasets/        # Single/multi-video loaders (recommended)
│   │   └── complex_datasets/       # Multi-modal research datasets (JSON/parquet)
│   └── utils/                      # Config, logging, data processing helpers
│
├── scripts/                        # All runnable scripts, organized by purpose
│   ├── training/                   # Model training
│   ├── inference/                  # Classification, OOD scoring, conformal prediction
│   ├── evaluation/                 # Post-prediction analysis and metrics
│   ├── data_processing/            # Video conversion, frame extraction, trimming
│   ├── visualization/              # Histogram and distribution plots
│   └── ros/                        # ROS server/subscriber for real-time deployment
│
├── examples/                       # Example videos for testing (gitignored, see examples/README.md)
│   ├── success/                    # Normal/successful runs (train/ and test/)
│   ├── known_failure/              # Known failure cases (train/ and test/)
│   └── ood/                        # Out-of-distribution videos (testing only)
├── outputs/                        # Script outputs (gitignored)
├── pyproject.toml                  # Dependencies
└── uv.lock                        # Dependency lock file

Installation

Requires Python ~3.10, FFmpeg, and a CUDA-capable GPU (recommended).

# 1. Create and activate a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate

# 2. Install project dependencies
uv sync        # or: pip install -e .

# 3. Install NVIDIA Cosmos Tokenizer (required for the world model)
#    Clone outside this repo, pull LFS weights, then install
git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git /path/to/Cosmos-Tokenizer
cd /path/to/Cosmos-Tokenizer
# Install ffmpeg and git-lfs if not already installed
# macOS:  brew install ffmpeg git-lfs
# Linux:  sudo apt-get install -y ffmpeg git-lfs
git lfs pull
uv pip install -e .
cd -

Data Setup

Video Format

Format: .mp4 (preferred), .avi, .mov, or .mkv
Resolution: Any — frames are automatically resized to 384x208 by the data loaders
Length: Videos can be different lengths — the data loader handles each independently. Optionally, you can trim videos to keep only the last N frames, since the end of a video is typically where success or failure is determined:

# Optional: keep only the last 300 frames of each video (discards the beginning)
python scripts/data_processing/trim_videos.py --input_dir /path/to/raw/videos --output_dir /path/to/trimmed/videos --frames 300

Folder Structure

Organize your videos into the examples/ directory (video files are gitignored):

examples/
├── success/
│   ├── train/          # Success videos for training (auto-split into train/val)
│   └── test/           # Success videos for calibration (conformal prediction)
├── known_failure/
│   ├── train/          # Failure videos for training (binary classifier only)
│   └── test/           # Failure videos for evaluating detection
└── ood/                # Out-of-distribution videos (testing only, never trained on)

success/train/ — Videos of normal, successful behavior. The model learns what "normal" looks like from these.
success/test/ — Held-out success videos used to calibrate conformal prediction thresholds. Ideally these are separate from training videos so thresholds are computed on unseen data. If you don't have enough videos, you can reuse the training videos for calibration.
known_failure/ — Videos of known failure cases. Used for evaluating detection.
ood/ — Out-of-distribution videos (unusual environments, unseen failure modes). Testing only — never used for training.

See examples/README.md for more details.

Usage

Training

Train the world model on success videos (learns normal behavior):

python scripts/training/main_train.py \
    --train_data_dir examples/success/train \
    --val_data_dir examples/success/train \
    --batch_size 1 --epochs 10 --frame_skip 10

Loss function: The model minimizes a combined loss over frame pairs (f_t, f_{t+1}):

loss = reconstruction_error + reconstruction_error_delta + 0.5 * hybrid_anomaly_score

Reconstruction error — MSE + SSIM between the predicted next frame and the actual next frame. This is the primary training signal.
Reconstruction error delta — Difference between next-frame reconstruction error and cross-step error (predicting the current frame). Encourages the model to predict forward in time rather than copying the input.
Hybrid anomaly score — Latent prediction error + perceptual error + center error. During training on success videos this should be low; at inference time, elevated scores indicate anomalies.

Training arguments:

Argument	Description
`--train_data_dir`	(Required) Directory of training videos
`--val_data_dir`	Validation video directory (defaults to `--train_data_dir` if not set)
`--batch_size`	Batch size (default: 32)
`--epochs`	Number of training epochs (default: 10)
`--frame_skip`	Sample every Nth frame (default: 1 = all frames). Higher values speed up training by skipping frames
`--validation_stride`	Use every Nth frame pair for validation (default: 3). Higher values speed up validation
`--num_workers`	DataLoader workers for training (default: 4)
`--val_workers`	DataLoader workers for validation (default: 1)
`--weight_decay`	Weight decay regularization (default: 1e-4)
`--use_unet`	Use UNet compression layer
`--use_svd`	Use SVD compression layer with optional ratio (e.g., `--use_svd 0.5`)
`--reconstruct_dir`	Optional: reconstruct videos from this directory after training as a sanity check
`--model_checkpoint`	Resume training from an existing checkpoint
`--wandb_entity` / `--wandb_project`	Enable Weights & Biases logging

Inference

Inference is a two-step process: (1) calibrate CP bands from success videos, then (2) score test videos against those bands.

Important: Use the same --frame_skip value for training, calibration, and classification. The model learns temporal patterns at a specific frame interval during training — using a different interval at inference will produce unreliable scores.

Step 1: Calibrate — Compute conformal prediction bands from calibration (success) videos:

python scripts/inference/calibrate.py \
    --model_checkpoint path/to/model.ckpt \
    --calibration_dir examples/success/test \
    --output_dir cp_bands \
    --frame_skip 10

This saves thresholds, calibration statistics, and per-metric scores to cp_bands/. Use --metrics to select specific metrics (default: all 7):

Available metrics: reconstruction_error, training_loss, mahalanobis, l2_to_mean, cosine_to_mean, latent_pred_error, latent_std

Argument	Description
`--model_checkpoint`	(Required) Path to trained model checkpoint
`--calibration_dir`	(Required) Directory of calibration (success) videos
`--output_dir`	Output directory for CP bands (default: `cp_bands`)
`--metrics`	Comma-separated metric names, or `all` (default: `all`)
`--alpha`	Significance level; 0.05 = 95th percentile threshold (default: 0.05)
`--frame_skip`	Frame skip interval, should match training (default: 1)
`--window_size`	Sliding window for windowed metrics (default: 10)

Step 2: Classify — Score test videos for OOD-ness using the saved CP bands:

python scripts/inference/classify.py \
    --model_checkpoint path/to/model.ckpt \
    --test_dir examples/known_failure/test \
    --bands_dir cp_bands \
    --output_dir ood_results

Argument	Description
`--model_checkpoint`	(Required) Path to trained model checkpoint
`--test_dir`	(Required) Directory of test videos to score
`--bands_dir`	(Required) CP bands directory from `calibrate.py`
`--output_dir`	Output directory for results (default: `ood_results`)
`--metrics`	Score only specific metrics (default: all from bands_dir)

Each metric produces a results.csv and a distribution.png histogram. In the histogram, scores to the left of the red threshold line are normal (not OOD), and scores to the right are flagged as OOD.

Optional: Recalculate threshold — Adjust the CP threshold at a different quantile without re-running calibration:

python scripts/inference/recalculate_threshold.py \
    cp_bands/reconstruction_error 0.90

Takes a metric's CP bands directory and a new quantile (e.g., 0.90 for 90th percentile), recomputes the threshold from saved calibration scores, and re-flags test videos.

Optional: Per-frame visualization — Plot per-frame scores for a single video against CP thresholds to see exactly when/where the threshold is exceeded:

python scripts/inference/classification_timeseries.py \
    --model_checkpoint path/to/model.ckpt \
    --video path/to/video.mp4 \
    --bands_dir cp_bands \
    --output_dir visualization_outputs

Produces a time-series plot (<video_name>_scores.png) with one subplot per metric showing scores over time, the threshold line, and shaded regions where the threshold is exceeded. Also saves per-frame scores as CSV. Use --metrics to plot only specific metrics.

Evaluation

Calculate detection rates — Compute detection rates from classify.py output:

python scripts/evaluation/calculate_detection_rate.py --results_dir ood_results -s

Scans the results directory for per-metric results.csv files and reports detection rates. Use -s for a summary table, -v for detailed output, or --input_file for a single file.

Compare detection overlap — Compare which videos are flagged as OOD across two different metrics:

python scripts/evaluation/compare_detection_overlap.py \
    ood_results/reconstruction_error/results.csv \
    ood_results/mahalanobis/results.csv \
    --key basename -v

Reports overlap statistics (Jaccard index, both/only-1/only-2/neither buckets). Use --list to print video names per bucket, --outdir to save bucket CSVs.

Compare detection timing — Measure how early or late the OOD detection is relative to human-labeled failure frames:

python scripts/evaluation/compare_detection_timing.py \
    --true_failures true_failures.csv \
    --detected_spikes ood_results/reconstruction_error/results.csv \
    --video_dir examples/known_failure/test \
    --output_dir timing_analysis \
    --frame_skip 10

Inputs:

--true_failures: CSV with columns video_name, true_failure_frames (human-labeled frame numbers)
--detected_spikes: Per-metric results.csv from classify.py (uses the exceeding_frames column)
--video_dir: Directory containing the videos (for FPS lookup)
--frame_skip: Frame skip rate used during inference (default: 1)
--use_first_detection: Optional flag to use the first detected spike instead of the closest

Outputs signed timing differences (negative = early detection, positive = late detection) with per-video and summary statistics.

Visualization

Plot score distributions — Compare calibration vs test score distributions with optional threshold line. Supports TikZ/pgf export for LaTeX papers:

python scripts/visualization/plot_score_distributions.py \
    --calibration_file cp_bands/reconstruction_error/calibration_scores.npy \
    --test1_csv ood_results/reconstruction_error/results.csv \
    --test2_csv ood_results_failure/reconstruction_error/results.csv \
    --threshold_file cp_bands/reconstruction_error/threshold.txt \
    --model_type success \
    --output_file score_distributions.png

Use --tikz for TikZ/pgf export, --model_type success|failure for automatic labeling (Nominal/OOD/Failure), and --bins, --alpha for plot customization.

Scripts Reference

Folder	Purpose	Key Scripts
`training/`	Model training	`main_train.py`
`inference/`	OOD scoring and detection	`calibrate.py`, `classify.py`, `recalculate_threshold.py`, `classification_timeseries.py`
`evaluation/`	Analyze results	`calculate_detection_rate.py`, `compare_detection_overlap.py`, `compare_detection_timing.py`
`data_processing/`	Prepare video data	`convert_rosbags.py`, `trim_videos.py`, `frames_to_video.py`, `convert_frames_to_seconds.py`
`visualization/`	Standalone plots	`plot_score_distributions.py`
`ros/`	Real-time deployment	`fail_server.py`, `fail_subscriber.py`

Models

CosmosWorld — World model built on NVIDIA Cosmos tokenizer for video prediction and anomaly detection
SVD/UNet Autoencoders — Optional compression layers for the latent space

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

World Model Failure Classification for Autonomous Inspection

Overview

Directory Structure

Installation

Data Setup

Video Format

Folder Structure

Usage

Training

Inference

Evaluation

Visualization

Scripts Reference

Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
scripts		scripts
src		src
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

World Model Failure Classification for Autonomous Inspection

Overview

Directory Structure

Installation

Data Setup

Video Format

Folder Structure

Usage

Training

Inference

Evaluation

Visualization

Scripts Reference

Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages