Skip to content

Metal agent: runtime memory pressure monitoring and eviction protection #186

@Defilan

Description

@Defilan

Background

macOS manages unified memory dynamically and distinguishes between pageable and wired (pinned) memory. Metal GPU buffers used by llama-server are wired memory. Under memory pressure, macOS's wired collector kernel mechanism can evict GPU buffers, causing severe performance degradation or process crashes — with no warning to the Metal agent.

The agent currently has no visibility into memory pressure and cannot react to it. A process that was healthy at startup can silently degrade as other applications compete for unified memory.

This issue was informed by research on vllm-metal's memory allocation strategy which highlights the wired collector problem on Apple Silicon.

Implementation Plan

This is being implemented in multiple PRs to keep reviews manageable:

PR A: Memory pressure detection + metrics (Phase 1)

  • Extend MemoryProvider with WiredMemory() and ProcessRSS()
  • Add MemoryPressureLevel type (Normal/Warning/Critical)
  • Create MemoryWatchdog goroutine (configurable interval, thresholds)
  • Add 5 new Prometheus metrics (available, wired, RSS, pressure level, evictions)
  • Add CLI flags (--memory-watchdog-interval, --memory-pressure-warning, --memory-pressure-critical, --eviction-enabled)
  • Unit tests for watchdog, pressure detection, RSS tracking
  • PR: feat: add memory pressure watchdog with runtime monitoring #216

PR B: Proactive eviction + status reporting (Phase 2)

  • Eviction logic in the onPressure callback (lowest-priority process evicted first)
  • Add MemoryPressure condition to InferenceService status
  • Emit Kubernetes events on pressure detection and eviction
  • Add /memstats endpoint to health server
  • Unit tests for eviction logic and status updates

Current State

  • Agent polls health every 5 seconds via the K8s watcher, but this only checks for InferenceService changes
  • Healthy flag is set once at startup and never re-evaluated (tracked in Metal agent: health checks, backpressure, and observability #171)
  • No monitoring of system memory pressure or Metal buffer eviction
  • No graceful degradation — if macOS kills a llama-server process, the agent doesn't know until the next health check (which doesn't exist continuously today)
  • No memory-related status information surfaced to Kubernetes

Proposed Work

Memory Pressure Monitoring

  • Periodically query system memory stats (vm_stat / host_statistics64) during agent runtime
  • Track wired memory usage trends and detect pressure (e.g., pageouts increasing, free memory below threshold)
  • Detect when managed llama-server processes are consuming more memory than expected

Proactive Protection

  • When memory pressure is detected, update InferenceService status with a MemoryPressure condition
  • Optionally gracefully stop lowest-priority inference processes before macOS force-kills them
  • Emit Kubernetes events when memory pressure is detected or when a process is evicted
  • Consider implementing a "memory watchdog" goroutine in the agent

Status Reporting

Documentation

  • Document the wired collector behavior and its impact on Metal inference
  • Provide guidance on disabling the wired collector for dedicated inference machines (sysctl settings)
  • Add troubleshooting section for memory-pressure-related failures

References

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions