-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Labels
component/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/mediumMedium priorityMedium priority
Description
Background
Apple Silicon's unified memory architecture requires explicit memory budgeting that doesn't apply to discrete NVIDIA GPUs. Users need a way to control how much of their system's unified memory LLMKube is allowed to consume for inference, especially on machines running other workloads.
Inspired by vllm-metal's VLLM_METAL_MEMORY_FRACTION approach and mistral.rs's adaptive RAM caps, this proposes adding first-class CRD support for memory budgeting on Metal.
Current State
- The
HardwareSpechasAcceleratorandGPUfields but no memory budget controls GPUSpechas aMemoryfield (string, e.g., "8Gi") but it's designed for NVIDIA resource requests, not unified memory budgetscatalog.yamlhasvram_estimateper model but it's informational only and not enforced- No way for users to express "use at most 75% of my system RAM for inference"
Proposed Changes
CRD Fields
Add to HardwareSpec in api/v1alpha1/model_types.go:
type HardwareSpec struct {
Accelerator string `json:"accelerator,omitempty"`
GPU *GPUSpec `json:"gpu,omitempty"`
// MemoryBudget sets an absolute memory limit for this model's inference process.
// Applies primarily to unified memory architectures (Metal).
// Example: "16Gi", "24Gi"
// +optional
MemoryBudget *resource.Quantity `json:"memoryBudget,omitempty"`
// MemoryFraction sets the maximum fraction of system memory available for
// inference on unified memory architectures (Metal). Range: 0.0-1.0.
// Defaults: 0.67 for systems ≤36GB, 0.75 for larger systems.
// Ignored when MemoryBudget is set (absolute takes precedence).
// +optional
MemoryFraction *float64 `json:"memoryFraction,omitempty"`
}Agent Integration
- Metal agent reads
memoryFraction/memoryBudgetfrom the Model's HardwareSpec - Agent flag
--memory-fractionserves as the global default - Per-model CRD values override the global default
- Pre-flight validation uses these values to determine if a model can be loaded
CLI Integration
llmkube deploy --memory-fraction 0.8for power users on dedicated machinesllmkube catalog info <model>shows estimated memory requirement vs available budgetllmkube statusshows current memory utilization when Metal services are running
Precedence Order
memoryBudgeton Model CRD (absolute, highest priority)memoryFractionon Model CRD--memory-fractionagent flag (global default)- Built-in adaptive default (0.67 for ≤36GB, 0.75 for >36GB)
Example Usage
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: llama-3-1-8b
spec:
source: "https://huggingface.co/..."
hardware:
accelerator: metal
memoryFraction: 0.8 # Dedicated inference Mac, allow more memory
gpu:
layers: 32apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: qwen-coder-32b
spec:
source: "https://huggingface.co/..."
hardware:
accelerator: metal
memoryBudget: "24Gi" # Hard cap regardless of system size
gpu:
layers: 64References
api/v1alpha1/model_types.go— HardwareSpec definitionpkg/agent/executor.go— reads GPU config for process spawningpkg/agent/agent.go— agent configurationcmd/metal-agent/main.go— agent CLI flags
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/mediumMedium priorityMedium priority