-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
component/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/mediumMedium priorityMedium priority
Description
Background
macOS manages unified memory dynamically and distinguishes between pageable and wired (pinned) memory. Metal GPU buffers used by llama-server are wired memory. Under memory pressure, macOS's wired collector kernel mechanism can evict GPU buffers, causing severe performance degradation or process crashes — with no warning to the Metal agent.
The agent currently has no visibility into memory pressure and cannot react to it. A process that was healthy at startup can silently degrade as other applications compete for unified memory.
This issue was informed by research on vllm-metal's memory allocation strategy which highlights the wired collector problem on Apple Silicon.
Implementation Plan
This is being implemented in multiple PRs to keep reviews manageable:
PR A: Memory pressure detection + metrics (Phase 1)
- Extend
MemoryProviderwithWiredMemory()andProcessRSS() - Add
MemoryPressureLeveltype (Normal/Warning/Critical) - Create
MemoryWatchdoggoroutine (configurable interval, thresholds) - Add 5 new Prometheus metrics (available, wired, RSS, pressure level, evictions)
- Add CLI flags (
--memory-watchdog-interval,--memory-pressure-warning,--memory-pressure-critical,--eviction-enabled) - Unit tests for watchdog, pressure detection, RSS tracking
- PR: feat: add memory pressure watchdog with runtime monitoring #216
PR B: Proactive eviction + status reporting (Phase 2)
- Eviction logic in the
onPressurecallback (lowest-priority process evicted first) - Add
MemoryPressurecondition to InferenceService status - Emit Kubernetes events on pressure detection and eviction
- Add
/memstatsendpoint to health server - Unit tests for eviction logic and status updates
Current State
- Agent polls health every 5 seconds via the K8s watcher, but this only checks for InferenceService changes
Healthyflag is set once at startup and never re-evaluated (tracked in Metal agent: health checks, backpressure, and observability #171)- No monitoring of system memory pressure or Metal buffer eviction
- No graceful degradation — if macOS kills a
llama-serverprocess, the agent doesn't know until the next health check (which doesn't exist continuously today) - No memory-related status information surfaced to Kubernetes
Proposed Work
Memory Pressure Monitoring
- Periodically query system memory stats (
vm_stat/host_statistics64) during agent runtime - Track wired memory usage trends and detect pressure (e.g., pageouts increasing, free memory below threshold)
- Detect when managed
llama-serverprocesses are consuming more memory than expected
Proactive Protection
- When memory pressure is detected, update InferenceService status with a
MemoryPressurecondition - Optionally gracefully stop lowest-priority inference processes before macOS force-kills them
- Emit Kubernetes events when memory pressure is detected or when a process is evicted
- Consider implementing a "memory watchdog" goroutine in the agent
Status Reporting
- Add memory utilization to the agent's
/metricsendpoint (ties into Metal agent: health checks, backpressure, and observability #171) - Report per-process memory usage (RSS, wired) in InferenceService status
- Surface system-level memory availability in agent health endpoint
Documentation
- Document the wired collector behavior and its impact on Metal inference
- Provide guidance on disabling the wired collector for dedicated inference machines (
sysctlsettings) - Add troubleshooting section for memory-pressure-related failures
References
- Metal agent: health checks, backpressure, and observability #171 — Metal agent health checks and observability (complementary)
pkg/agent/agent.go— agent main looppkg/agent/executor.go— process managementpkg/agent/watcher.go— 5-second polling loop
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/mediumMedium priorityMedium priority