Skip to content

castops/cast-slice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CastSlice

"Stop burning GPU dollars. Start slicing."

CI Latest Release Go Version Go Report Card Kubernetes License: MIT


πŸ”΄ The Problem in 2026

In the era of ubiquitous AI, GPU scarcity is no longer the only bottleneckβ€”GPU Waste is. Most development, CI/CD, and inference workloads request a full NVIDIA GPU but utilize less than 15% of its hardware capability.

  • Cloud Bills: You pay for 100% of a GPU while your workloads use a fraction.
  • Scheduling Bottlenecks: Pending Pods waiting for a "Full GPU" while existing GPUs sit idle.
  • Developer Friction: Teams manually editing YAMLs to share resources.

🟒 The CastSlice Solution

CastSlice is a lightweight, non-invasive Kubernetes Mutating Webhook that automatically converts "Whole GPU" requests into "Fractional/Shared GPU" slices based on smart policy.

It sits in your K8s Control Plane, intercepts Pod creation, and performs on-the-fly resource transformationβ€”without changing a single line of your application code.


✨ Key Features

Feature The "Old" Way The CastSlice Way
Cost Full GPU per Pod Shared GPU across multiple Pods
Concurrency 1 Pod per GPU Multiple Pods per GPU
Developer UX Manual YAML changes Zero-touch. Just add an annotation.
Vendor Lock-in Locked to specific CSP tools Cloud Agnostic. Works on EKS, GKE, AKS, or On-prem.

πŸ›  How It Works

CastSlice transparently rewrites nvidia.com/gpu resource requests into nvidia.com/gpu-shared resource requests for Pods that opt in via an annotation.

Pod CREATE request
       β”‚
       β–Ό
 Kubernetes API server
       β”‚ (forwards to webhook)
       β–Ό
 CastSlice webhook
       β”‚
       β”œβ”€β”€ castops.io/optimize: "true" annotation present?
       β”‚        β”‚ YES                       β”‚ NO
       β”‚        β–Ό                           β–Ό
       β”‚  resolve slice ratio           allow unchanged
       β”‚  (slice-ratio > workload-type > default: 1)
       β”‚        β”‚
       β”‚  remove nvidia.com/gpu
       β”‚  add    nvidia.com/gpu-shared: <ratio>
       β”‚        β”‚
       β–Ό        β–Ό
 JSON Patch returned β†’ Pod scheduled with shared GPU

Annotations:

Annotation Value Effect
castops.io/optimize "true" Enable GPU slice optimization (required)
castops.io/workload-type training / inference / batch / dev Select preset slice ratio
castops.io/slice-ratio "N" (positive integer) Override slice count directly

Preset ratios by workload type:

Workload Type GPU Slices Use Case
training 4 Model training jobs β€” higher GPU share
inference 2 Serving / Triton inference servers
batch 2 Batch preprocessing and feature extraction
dev 1 Development and debugging (default)

πŸš€ Quick Start

Prerequisites

  • Go 1.24+
  • A Kubernetes cluster (KinD / Minikube for local testing)
  • cert-manager for TLS certificate injection

1. Install CastSlice

# Install cert-manager (if not already present)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

# Wait for cert-manager to be ready
kubectl rollout status deployment/cert-manager -n cert-manager

# Deploy CastSlice
kubectl apply -f https://github.com/castops/cast-slice/releases/latest/download/install.yaml

# Create the TLS certificate for the webhook (issued by cert-manager)
kubectl apply -f config/cert/certificate.yaml

# Wait for the webhook pod to be ready
kubectl rollout status deployment/cast-slice -n cast-slice

2. Deploy an Optimized Workload

Add the castops.io/optimize annotation and optionally specify the workload type:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-inference
spec:
  template:
    metadata:
      annotations:
        castops.io/optimize: "true"
        castops.io/workload-type: "inference"  # β†’ gpu-shared: 2
    spec:
      containers:
      - name: ollama
        image: ollama/ollama
        resources:
          limits:
            nvidia.com/gpu: 1 # CastSlice rewrites this based on workload type

For fine-grained control, use an explicit ratio:

annotations:
  castops.io/optimize: "true"
  castops.io/slice-ratio: "8"   # explicit override β†’ gpu-shared: 8

3. Verify It's Working

# Check the mutated pod
kubectl get pod -o yaml | grep gpu-shared
# training workload: nvidia.com/gpu-shared: "4"
# inference workload: nvidia.com/gpu-shared: "2"
# dev workload (default): nvidia.com/gpu-shared: "1"

πŸ“Š Metrics & Monitoring (v0.3.0)

CastSlice exposes Prometheus metrics on :8080/metrics via the standard controller-runtime metrics server.

Exposed Metrics

Metric Type Description
castslice_requests_total Counter Total admission requests processed
castslice_mutations_total Counter Pods mutated with GPU slice rewrites
castslice_noop_total Counter Pods allowed without mutation
castslice_errors_total Counter Requests rejected with an error

Access the Metrics Endpoint

# Port-forward the metrics service
kubectl port-forward svc/cast-slice-metrics 8080:8080 -n cast-slice

# Scrape metrics
curl http://localhost:8080/metrics | grep castslice

Example output:

# HELP castslice_errors_total Total number of admission requests rejected with an error.
# TYPE castslice_errors_total counter
castslice_errors_total 0
# HELP castslice_mutations_total Total number of Pods mutated with GPU slice rewrites.
# TYPE castslice_mutations_total counter
castslice_mutations_total 42
# HELP castslice_noop_total Total number of Pods allowed without mutation (no annotation or no GPU limits).
# TYPE castslice_noop_total counter
castslice_noop_total 158
# HELP castslice_requests_total Total number of admission requests processed by the CastSlice webhook.
# TYPE castslice_requests_total counter
castslice_requests_total 200

Grafana Dashboard

A ready-to-use Grafana dashboard is included at config/monitoring/grafana-dashboard.yaml. It provides 5 panels:

  • Webhook Request Rate β€” overall admission throughput
  • GPU Slice Mutations Rate β€” how many Pods per second get GPU sharing enabled
  • No-op Rate β€” Pods passing through unchanged
  • Error Rate β€” invalid annotation rejections (alert if non-zero)
  • Mutation Efficiency β€” fraction of requests resulting in a GPU slice (higher = more GPU sharing)

Import via kubectl (auto-loads if kube-prometheus-stack sidecar dashboards are enabled):

kubectl apply -f config/monitoring/grafana-dashboard.yaml

Manual import: Grafana β†’ Dashboards β†’ Import β†’ paste the JSON from the ConfigMap's castslice-finops-dashboard.json key.

Prometheus Scrape Configuration

The cast-slice-metrics Service is deployed with standard Prometheus annotations (prometheus.io/scrape: "true") so node-based Prometheus auto-discovery picks it up automatically. No additional scrape config is required for most setups.


πŸ“ Project Structure

cast-slice/
β”œβ”€β”€ main.go                          # Manager + Webhook registration
β”œβ”€β”€ TODOS.md                         # Planned improvements and deferred work
β”œβ”€β”€ internal/
β”‚   └── webhook/
β”‚       β”œβ”€β”€ pod_webhook.go           # Mutating webhook handler
β”‚       β”œβ”€β”€ metrics.go               # Prometheus counter definitions
β”‚       └── pod_webhook_test.go      # Unit tests
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ deploy/deployment.yaml       # Namespace, SA, Deployment, Services (webhook + metrics)
β”‚   β”œβ”€β”€ webhook/mutating_webhook.yaml# MutatingWebhookConfiguration
β”‚   └── monitoring/
β”‚       └── grafana-dashboard.yaml   # FinOps Grafana dashboard ConfigMap
└── docs/
    β”œβ”€β”€ local-testing.md             # How to test without a real GPU
    β”œβ”€β”€ node-mock.yaml               # Mock node labels
    └── test-pod.yaml                # Test Pod that triggers the webhook

πŸ§ͺ Development

Build from Source

go build -o cast-slice .

Run Tests

go test ./...

Manual Deployment

# Apply workload manifests
kubectl apply -f config/deploy/deployment.yaml
kubectl apply -f config/webhook/mutating_webhook.yaml

Local Testing Without a GPU

See docs/local-testing.md for a step-by-step guide on mocking GPU capacity and validating webhook behavior.


πŸ— Roadmap

  • v0.1.0: Basic Mutating Webhook (Static Slicing).
  • v0.2.0: Smart Slicing (Dynamic ratios based on workload type).
  • v0.3.0: FinOps Dashboard (Live GPU utilization metrics).
  • v0.4.0: Policy Engine (Namespace-level and label-based rules).
  • v0.5.0: Multi-GPU Support (Cross-node GPU sharing).

🀝 Contributing

We're looking for FinOps-minded engineers to help optimize GPU infrastructure for the AI era.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

Distributed under the MIT License. See LICENSE for more information.


Built by CastOps - Engineering the Future of AI Infrastructure.

About

Stop burning GPU dollars. Start slicing.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages