Skip to content

Metal agent: pre-flight memory validation before spawning llama-server #185

@Defilan

Description

@Defilan

Background

The Metal agent currently spawns llama-server processes with no awareness of available system memory. On Apple Silicon, unified memory is shared between CPU, GPU, Neural Engine, and all other processes — unlike discrete NVIDIA GPUs with isolated VRAM.

If a user deploys a large model (e.g., Qwen 32B at ~20-24GB) on a 32GB Mac running other applications, there are no guard rails to prevent memory pressure, degraded performance, or macOS force-killing the process via the wired collector.

Relevant prior art from vllm-metal's memory allocation research:

  • vLLM reserves 90% of GPU memory at startup — dangerous on shared unified memory
  • mistral.rs caps at 2/3 of system RAM for systems ≤36GB, 3/4 for larger systems
  • llama.cpp (our backend) has no system-wide memory awareness at all

Current State

  • executor.go spawns llama-server with --n-gpu-layers and --ctx-size but performs no memory checks
  • We already parse GGUF metadata (layer count, quantization, context length) in model_controller.go — this data could be used to estimate memory requirements
  • catalog.yaml has vram_estimate fields per model but they're informational only
  • No detection of available system memory before process start
  • No feedback to users when a model is too large for their hardware

Proposed Work

Pre-flight Memory Check

  • Query total and available system memory at agent startup and before each process spawn
  • Estimate model memory requirements from GGUF metadata (weights + KV cache for requested context size)
  • Compare estimated requirements against available memory with a configurable safety margin
  • Refuse to start with a clear error message if estimated usage exceeds the budget
  • Surface the memory check result in InferenceService status conditions

Configurable Memory Budget

  • Add --memory-fraction flag to Metal agent (default: 0.67 for ≤36GB systems, 0.75 for larger — following mistral.rs heuristics)
  • Allow override via InferenceService annotation or CRD field
  • Log effective memory budget at startup

Memory Estimation

  • Use GGUF metadata already available: quantization type, layer count, embedding size
  • Estimate KV cache memory from context size and model dimensions
  • Account for llama.cpp overhead (~500MB baseline)
  • Validate estimates against catalog vram_estimate values

Example Behavior

$ llmkube deploy qwen-coder-32b --accelerator metal
Error: insufficient memory for qwen-coder-32b (Q4_K_M)
  Estimated requirement: ~22GB
  Available budget: 18GB (32GB system × 0.67 fraction − 3.4GB in use)
  Suggestion: try a smaller quantization, reduce --ctx-size, or free system memory

References

  • pkg/agent/executor.go — process spawning
  • internal/controller/model_controller.go — GGUF metadata parsing
  • pkg/cli/catalog.yaml — vram_estimate fields
  • pkg/gguf/parser.go — GGUF file parsing

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions