Skip to content

Add memoryFraction and memoryBudget fields to CRD for unified memory control #187

@Defilan

Description

@Defilan

Background

Apple Silicon's unified memory architecture requires explicit memory budgeting that doesn't apply to discrete NVIDIA GPUs. Users need a way to control how much of their system's unified memory LLMKube is allowed to consume for inference, especially on machines running other workloads.

Inspired by vllm-metal's VLLM_METAL_MEMORY_FRACTION approach and mistral.rs's adaptive RAM caps, this proposes adding first-class CRD support for memory budgeting on Metal.

Current State

  • The HardwareSpec has Accelerator and GPU fields but no memory budget controls
  • GPUSpec has a Memory field (string, e.g., "8Gi") but it's designed for NVIDIA resource requests, not unified memory budgets
  • catalog.yaml has vram_estimate per model but it's informational only and not enforced
  • No way for users to express "use at most 75% of my system RAM for inference"

Proposed Changes

CRD Fields

Add to HardwareSpec in api/v1alpha1/model_types.go:

type HardwareSpec struct {
    Accelerator string   `json:"accelerator,omitempty"`
    GPU         *GPUSpec `json:"gpu,omitempty"`

    // MemoryBudget sets an absolute memory limit for this model's inference process.
    // Applies primarily to unified memory architectures (Metal).
    // Example: "16Gi", "24Gi"
    // +optional
    MemoryBudget *resource.Quantity `json:"memoryBudget,omitempty"`

    // MemoryFraction sets the maximum fraction of system memory available for
    // inference on unified memory architectures (Metal). Range: 0.0-1.0.
    // Defaults: 0.67 for systems ≤36GB, 0.75 for larger systems.
    // Ignored when MemoryBudget is set (absolute takes precedence).
    // +optional
    MemoryFraction *float64 `json:"memoryFraction,omitempty"`
}

Agent Integration

  • Metal agent reads memoryFraction / memoryBudget from the Model's HardwareSpec
  • Agent flag --memory-fraction serves as the global default
  • Per-model CRD values override the global default
  • Pre-flight validation uses these values to determine if a model can be loaded

CLI Integration

  • llmkube deploy --memory-fraction 0.8 for power users on dedicated machines
  • llmkube catalog info <model> shows estimated memory requirement vs available budget
  • llmkube status shows current memory utilization when Metal services are running

Precedence Order

  1. memoryBudget on Model CRD (absolute, highest priority)
  2. memoryFraction on Model CRD
  3. --memory-fraction agent flag (global default)
  4. Built-in adaptive default (0.67 for ≤36GB, 0.75 for >36GB)

Example Usage

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: llama-3-1-8b
spec:
  source: "https://huggingface.co/..."
  hardware:
    accelerator: metal
    memoryFraction: 0.8  # Dedicated inference Mac, allow more memory
    gpu:
      layers: 32
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: qwen-coder-32b
spec:
  source: "https://huggingface.co/..."
  hardware:
    accelerator: metal
    memoryBudget: "24Gi"  # Hard cap regardless of system size
    gpu:
      layers: 64

References

  • api/v1alpha1/model_types.go — HardwareSpec definition
  • pkg/agent/executor.go — reads GPU config for process spawning
  • pkg/agent/agent.go — agent configuration
  • cmd/metal-agent/main.go — agent CLI flags

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions