Metal agent: runtime memory pressure monitoring and eviction protection

## Background

macOS manages unified memory dynamically and distinguishes between pageable and wired (pinned) memory. Metal GPU buffers used by `llama-server` are wired memory. Under memory pressure, macOS's **wired collector** kernel mechanism can evict GPU buffers, causing severe performance degradation or process crashes — with no warning to the Metal agent.

The agent currently has no visibility into memory pressure and cannot react to it. A process that was healthy at startup can silently degrade as other applications compete for unified memory.

This issue was informed by [research on vllm-metal's memory allocation strategy](https://ranranhaoranzhang.com/blog/2026/llm-inference-memory-allocation-apple-silicon/) which highlights the wired collector problem on Apple Silicon.

## Implementation Plan

This is being implemented in multiple PRs to keep reviews manageable:

### PR A: Memory pressure detection + metrics (Phase 1)
- [x] Extend `MemoryProvider` with `WiredMemory()` and `ProcessRSS()`
- [x] Add `MemoryPressureLevel` type (Normal/Warning/Critical)
- [x] Create `MemoryWatchdog` goroutine (configurable interval, thresholds)
- [x] Add 5 new Prometheus metrics (available, wired, RSS, pressure level, evictions)
- [x] Add CLI flags (`--memory-watchdog-interval`, `--memory-pressure-warning`, `--memory-pressure-critical`, `--eviction-enabled`)
- [x] Unit tests for watchdog, pressure detection, RSS tracking
- **PR:** #216

### PR B: Proactive eviction + status reporting (Phase 2)
- [ ] Eviction logic in the `onPressure` callback (lowest-priority process evicted first)
- [ ] Add `MemoryPressure` condition to InferenceService status
- [ ] Emit Kubernetes events on pressure detection and eviction
- [ ] Add `/memstats` endpoint to health server
- [ ] Unit tests for eviction logic and status updates

## Current State

- Agent polls health every 5 seconds via the K8s watcher, but this only checks for InferenceService changes
- `Healthy` flag is set once at startup and never re-evaluated (tracked in #171)
- No monitoring of system memory pressure or Metal buffer eviction
- No graceful degradation — if macOS kills a `llama-server` process, the agent doesn't know until the next health check (which doesn't exist continuously today)
- No memory-related status information surfaced to Kubernetes

## Proposed Work

### Memory Pressure Monitoring
- [x] Periodically query system memory stats (`vm_stat` / `host_statistics64`) during agent runtime
- [x] Track wired memory usage trends and detect pressure (e.g., pageouts increasing, free memory below threshold)
- [x] Detect when managed `llama-server` processes are consuming more memory than expected

### Proactive Protection
- [ ] When memory pressure is detected, update InferenceService status with a `MemoryPressure` condition
- [ ] Optionally gracefully stop lowest-priority inference processes before macOS force-kills them
- [ ] Emit Kubernetes events when memory pressure is detected or when a process is evicted
- [x] Consider implementing a "memory watchdog" goroutine in the agent

### Status Reporting
- [x] Add memory utilization to the agent's `/metrics` endpoint (ties into #171)
- [ ] Report per-process memory usage (RSS, wired) in InferenceService status
- [ ] Surface system-level memory availability in agent health endpoint

### Documentation
- [ ] Document the wired collector behavior and its impact on Metal inference
- [ ] Provide guidance on disabling the wired collector for dedicated inference machines (`sysctl` settings)
- [ ] Add troubleshooting section for memory-pressure-related failures

## References

- #171 — Metal agent health checks and observability (complementary)
- `pkg/agent/agent.go` — agent main loop
- `pkg/agent/executor.go` — process management
- `pkg/agent/watcher.go` — 5-second polling loop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal agent: runtime memory pressure monitoring and eviction protection #186

Background

Implementation Plan

PR A: Memory pressure detection + metrics (Phase 1)

PR B: Proactive eviction + status reporting (Phase 2)

Current State

Proposed Work

Memory Pressure Monitoring

Proactive Protection

Status Reporting

Documentation

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metal agent: runtime memory pressure monitoring and eviction protection #186

Description

Background

Implementation Plan

PR A: Memory pressure detection + metrics (Phase 1)

PR B: Proactive eviction + status reporting (Phase 2)

Current State

Proposed Work

Memory Pressure Monitoring

Proactive Protection

Status Reporting

Documentation

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions