-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Labels
component/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/highHigh priorityHigh priority
Description
Background
The Metal agent currently spawns llama-server processes with no awareness of available system memory. On Apple Silicon, unified memory is shared between CPU, GPU, Neural Engine, and all other processes — unlike discrete NVIDIA GPUs with isolated VRAM.
If a user deploys a large model (e.g., Qwen 32B at ~20-24GB) on a 32GB Mac running other applications, there are no guard rails to prevent memory pressure, degraded performance, or macOS force-killing the process via the wired collector.
Relevant prior art from vllm-metal's memory allocation research:
- vLLM reserves 90% of GPU memory at startup — dangerous on shared unified memory
- mistral.rs caps at 2/3 of system RAM for systems ≤36GB, 3/4 for larger systems
- llama.cpp (our backend) has no system-wide memory awareness at all
Current State
executor.gospawnsllama-serverwith--n-gpu-layersand--ctx-sizebut performs no memory checks- We already parse GGUF metadata (layer count, quantization, context length) in
model_controller.go— this data could be used to estimate memory requirements catalog.yamlhasvram_estimatefields per model but they're informational only- No detection of available system memory before process start
- No feedback to users when a model is too large for their hardware
Proposed Work
Pre-flight Memory Check
- Query total and available system memory at agent startup and before each process spawn
- Estimate model memory requirements from GGUF metadata (weights + KV cache for requested context size)
- Compare estimated requirements against available memory with a configurable safety margin
- Refuse to start with a clear error message if estimated usage exceeds the budget
- Surface the memory check result in InferenceService status conditions
Configurable Memory Budget
- Add
--memory-fractionflag to Metal agent (default: 0.67 for ≤36GB systems, 0.75 for larger — following mistral.rs heuristics) - Allow override via InferenceService annotation or CRD field
- Log effective memory budget at startup
Memory Estimation
- Use GGUF metadata already available: quantization type, layer count, embedding size
- Estimate KV cache memory from context size and model dimensions
- Account for llama.cpp overhead (~500MB baseline)
- Validate estimates against catalog
vram_estimatevalues
Example Behavior
$ llmkube deploy qwen-coder-32b --accelerator metal
Error: insufficient memory for qwen-coder-32b (Q4_K_M)
Estimated requirement: ~22GB
Available budget: 18GB (32GB system × 0.67 fraction − 3.4GB in use)
Suggestion: try a smaller quantization, reduce --ctx-size, or free system memory
References
pkg/agent/executor.go— process spawninginternal/controller/model_controller.go— GGUF metadata parsingpkg/cli/catalog.yaml— vram_estimate fieldspkg/gguf/parser.go— GGUF file parsing
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/highHigh priorityHigh priority