A machine learning framework for video-based failure detection and classification using world models with conformal prediction.
This project trains world models on video data to detect and classify failures during autonomous inspection tasks. It supports:
- Anomaly detection (single-class) — detect when behavior deviates from normal
- Uncertainty quantification via conformal prediction bands
- Multiple OOD detection methods — Mahalanobis distance, reconstruction error, latent space metrics, and hybrid approaches
- Real-time deployment via ROS integration
GaugeFailClassification/
├── src/ # Core library
│ ├── models/ # Neural network models
│ │ ├── cosmos_world.py # NVIDIA Cosmos-based world model
│ │ ├── cosmos_world_classifier.py # Binary success/failure classifier
│ │ ├── latent_world.py # Latent space world model
│ │ ├── svd_autoencoder.py # SVD compression layer
│ │ ├── unet_autoencoder.py # U-Net compression layer
│ │ ├── general.py # Shared utilities and loss functions
│ │ └── legacy/ # Latent normalizing flow (research)
│ ├── data/ # Dataset loaders
│ │ ├── simple_datasets/ # Single/multi-video loaders (recommended)
│ │ └── complex_datasets/ # Multi-modal research datasets (JSON/parquet)
│ └── utils/ # Config, logging, data processing helpers
│
├── scripts/ # All runnable scripts, organized by purpose
│ ├── training/ # Model training
│ ├── inference/ # Classification, OOD scoring, conformal prediction
│ ├── evaluation/ # Post-prediction analysis and metrics
│ ├── data_processing/ # Video conversion, frame extraction, trimming
│ ├── visualization/ # Histogram and distribution plots
│ └── ros/ # ROS server/subscriber for real-time deployment
│
├── examples/ # Example videos for testing (gitignored, see examples/README.md)
│ ├── success/ # Normal/successful runs (train/ and test/)
│ ├── known_failure/ # Known failure cases (train/ and test/)
│ └── ood/ # Out-of-distribution videos (testing only)
├── outputs/ # Script outputs (gitignored)
├── pyproject.toml # Dependencies
└── uv.lock # Dependency lock file
Requires Python ~3.10, FFmpeg, and a CUDA-capable GPU (recommended).
# 1. Create and activate a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
# 2. Install project dependencies
uv sync # or: pip install -e .
# 3. Install NVIDIA Cosmos Tokenizer (required for the world model)
# Clone outside this repo, pull LFS weights, then install
git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git /path/to/Cosmos-Tokenizer
cd /path/to/Cosmos-Tokenizer
# Install ffmpeg and git-lfs if not already installed
# macOS: brew install ffmpeg git-lfs
# Linux: sudo apt-get install -y ffmpeg git-lfs
git lfs pull
uv pip install -e .
cd -- Format:
.mp4(preferred),.avi,.mov, or.mkv - Resolution: Any — frames are automatically resized to 384x208 by the data loaders
- Length: Videos can be different lengths — the data loader handles each independently. Optionally, you can trim videos to keep only the last N frames, since the end of a video is typically where success or failure is determined:
# Optional: keep only the last 300 frames of each video (discards the beginning)
python scripts/data_processing/trim_videos.py --input_dir /path/to/raw/videos --output_dir /path/to/trimmed/videos --frames 300Organize your videos into the examples/ directory (video files are gitignored):
examples/
├── success/
│ ├── train/ # Success videos for training (auto-split into train/val)
│ └── test/ # Success videos for calibration (conformal prediction)
├── known_failure/
│ ├── train/ # Failure videos for training (binary classifier only)
│ └── test/ # Failure videos for evaluating detection
└── ood/ # Out-of-distribution videos (testing only, never trained on)
success/train/— Videos of normal, successful behavior. The model learns what "normal" looks like from these.success/test/— Held-out success videos used to calibrate conformal prediction thresholds. Ideally these are separate from training videos so thresholds are computed on unseen data. If you don't have enough videos, you can reuse the training videos for calibration.known_failure/— Videos of known failure cases. Used for evaluating detection.ood/— Out-of-distribution videos (unusual environments, unseen failure modes). Testing only — never used for training.
See examples/README.md for more details.
Train the world model on success videos (learns normal behavior):
python scripts/training/main_train.py \
--train_data_dir examples/success/train \
--val_data_dir examples/success/train \
--batch_size 1 --epochs 10 --frame_skip 10Loss function: The model minimizes a combined loss over frame pairs (f_t, f_{t+1}):
loss = reconstruction_error + reconstruction_error_delta + 0.5 * hybrid_anomaly_score
- Reconstruction error — MSE + SSIM between the predicted next frame and the actual next frame. This is the primary training signal.
- Reconstruction error delta — Difference between next-frame reconstruction error and cross-step error (predicting the current frame). Encourages the model to predict forward in time rather than copying the input.
- Hybrid anomaly score — Latent prediction error + perceptual error + center error. During training on success videos this should be low; at inference time, elevated scores indicate anomalies.
Training arguments:
| Argument | Description |
|---|---|
--train_data_dir |
(Required) Directory of training videos |
--val_data_dir |
Validation video directory (defaults to --train_data_dir if not set) |
--batch_size |
Batch size (default: 32) |
--epochs |
Number of training epochs (default: 10) |
--frame_skip |
Sample every Nth frame (default: 1 = all frames). Higher values speed up training by skipping frames |
--validation_stride |
Use every Nth frame pair for validation (default: 3). Higher values speed up validation |
--num_workers |
DataLoader workers for training (default: 4) |
--val_workers |
DataLoader workers for validation (default: 1) |
--weight_decay |
Weight decay regularization (default: 1e-4) |
--use_unet |
Use UNet compression layer |
--use_svd |
Use SVD compression layer with optional ratio (e.g., --use_svd 0.5) |
--reconstruct_dir |
Optional: reconstruct videos from this directory after training as a sanity check |
--model_checkpoint |
Resume training from an existing checkpoint |
--wandb_entity / --wandb_project |
Enable Weights & Biases logging |
Inference is a two-step process: (1) calibrate CP bands from success videos, then (2) score test videos against those bands.
Important: Use the same --frame_skip value for training, calibration, and classification. The model learns temporal patterns at a specific frame interval during training — using a different interval at inference will produce unreliable scores.
Step 1: Calibrate — Compute conformal prediction bands from calibration (success) videos:
python scripts/inference/calibrate.py \
--model_checkpoint path/to/model.ckpt \
--calibration_dir examples/success/test \
--output_dir cp_bands \
--frame_skip 10This saves thresholds, calibration statistics, and per-metric scores to cp_bands/. Use --metrics to select specific metrics (default: all 7):
Available metrics: reconstruction_error, training_loss, mahalanobis, l2_to_mean, cosine_to_mean, latent_pred_error, latent_std
| Argument | Description |
|---|---|
--model_checkpoint |
(Required) Path to trained model checkpoint |
--calibration_dir |
(Required) Directory of calibration (success) videos |
--output_dir |
Output directory for CP bands (default: cp_bands) |
--metrics |
Comma-separated metric names, or all (default: all) |
--alpha |
Significance level; 0.05 = 95th percentile threshold (default: 0.05) |
--frame_skip |
Frame skip interval, should match training (default: 1) |
--window_size |
Sliding window for windowed metrics (default: 10) |
Step 2: Classify — Score test videos for OOD-ness using the saved CP bands:
python scripts/inference/classify.py \
--model_checkpoint path/to/model.ckpt \
--test_dir examples/known_failure/test \
--bands_dir cp_bands \
--output_dir ood_results| Argument | Description |
|---|---|
--model_checkpoint |
(Required) Path to trained model checkpoint |
--test_dir |
(Required) Directory of test videos to score |
--bands_dir |
(Required) CP bands directory from calibrate.py |
--output_dir |
Output directory for results (default: ood_results) |
--metrics |
Score only specific metrics (default: all from bands_dir) |
Each metric produces a results.csv and a distribution.png histogram. In the histogram, scores to the left of the red threshold line are normal (not OOD), and scores to the right are flagged as OOD.
Optional: Recalculate threshold — Adjust the CP threshold at a different quantile without re-running calibration:
python scripts/inference/recalculate_threshold.py \
cp_bands/reconstruction_error 0.90Takes a metric's CP bands directory and a new quantile (e.g., 0.90 for 90th percentile), recomputes the threshold from saved calibration scores, and re-flags test videos.
Optional: Per-frame visualization — Plot per-frame scores for a single video against CP thresholds to see exactly when/where the threshold is exceeded:
python scripts/inference/classification_timeseries.py \
--model_checkpoint path/to/model.ckpt \
--video path/to/video.mp4 \
--bands_dir cp_bands \
--output_dir visualization_outputsProduces a time-series plot (<video_name>_scores.png) with one subplot per metric showing scores over time, the threshold line, and shaded regions where the threshold is exceeded. Also saves per-frame scores as CSV. Use --metrics to plot only specific metrics.
Calculate detection rates — Compute detection rates from classify.py output:
python scripts/evaluation/calculate_detection_rate.py --results_dir ood_results -sScans the results directory for per-metric results.csv files and reports detection rates. Use -s for a summary table, -v for detailed output, or --input_file for a single file.
Compare detection overlap — Compare which videos are flagged as OOD across two different metrics:
python scripts/evaluation/compare_detection_overlap.py \
ood_results/reconstruction_error/results.csv \
ood_results/mahalanobis/results.csv \
--key basename -vReports overlap statistics (Jaccard index, both/only-1/only-2/neither buckets). Use --list to print video names per bucket, --outdir to save bucket CSVs.
Compare detection timing — Measure how early or late the OOD detection is relative to human-labeled failure frames:
python scripts/evaluation/compare_detection_timing.py \
--true_failures true_failures.csv \
--detected_spikes ood_results/reconstruction_error/results.csv \
--video_dir examples/known_failure/test \
--output_dir timing_analysis \
--frame_skip 10Inputs:
--true_failures: CSV with columnsvideo_name, true_failure_frames(human-labeled frame numbers)--detected_spikes: Per-metricresults.csvfromclassify.py(uses theexceeding_framescolumn)--video_dir: Directory containing the videos (for FPS lookup)--frame_skip: Frame skip rate used during inference (default: 1)--use_first_detection: Optional flag to use the first detected spike instead of the closest
Outputs signed timing differences (negative = early detection, positive = late detection) with per-video and summary statistics.
Plot score distributions — Compare calibration vs test score distributions with optional threshold line. Supports TikZ/pgf export for LaTeX papers:
python scripts/visualization/plot_score_distributions.py \
--calibration_file cp_bands/reconstruction_error/calibration_scores.npy \
--test1_csv ood_results/reconstruction_error/results.csv \
--test2_csv ood_results_failure/reconstruction_error/results.csv \
--threshold_file cp_bands/reconstruction_error/threshold.txt \
--model_type success \
--output_file score_distributions.pngUse --tikz for TikZ/pgf export, --model_type success|failure for automatic labeling (Nominal/OOD/Failure), and --bins, --alpha for plot customization.
| Folder | Purpose | Key Scripts |
|---|---|---|
training/ |
Model training | main_train.py |
inference/ |
OOD scoring and detection | calibrate.py, classify.py, recalculate_threshold.py, classification_timeseries.py |
evaluation/ |
Analyze results | calculate_detection_rate.py, compare_detection_overlap.py, compare_detection_timing.py |
data_processing/ |
Prepare video data | convert_rosbags.py, trim_videos.py, frames_to_video.py, convert_frames_to_seconds.py |
visualization/ |
Standalone plots | plot_score_distributions.py |
ros/ |
Real-time deployment | fail_server.py, fail_subscriber.py |
- CosmosWorld — World model built on NVIDIA Cosmos tokenizer for video prediction and anomaly detection
- SVD/UNet Autoencoders — Optional compression layers for the latent space