Add memoryFraction and memoryBudget fields to CRD for unified memory control

## Background

Apple Silicon's unified memory architecture requires explicit memory budgeting that doesn't apply to discrete NVIDIA GPUs. Users need a way to control how much of their system's unified memory LLMKube is allowed to consume for inference, especially on machines running other workloads.

Inspired by vllm-metal's `VLLM_METAL_MEMORY_FRACTION` approach and mistral.rs's adaptive RAM caps, this proposes adding first-class CRD support for memory budgeting on Metal.

## Current State

- The `HardwareSpec` has `Accelerator` and `GPU` fields but no memory budget controls
- `GPUSpec` has a `Memory` field (string, e.g., "8Gi") but it's designed for NVIDIA resource requests, not unified memory budgets
- `catalog.yaml` has `vram_estimate` per model but it's informational only and not enforced
- No way for users to express "use at most 75% of my system RAM for inference"

## Proposed Changes

### CRD Fields

Add to `HardwareSpec` in `api/v1alpha1/model_types.go`:

```go
type HardwareSpec struct {
    Accelerator string   `json:"accelerator,omitempty"`
    GPU         *GPUSpec `json:"gpu,omitempty"`

    // MemoryBudget sets an absolute memory limit for this model's inference process.
    // Applies primarily to unified memory architectures (Metal).
    // Example: "16Gi", "24Gi"
    // +optional
    MemoryBudget *resource.Quantity `json:"memoryBudget,omitempty"`

    // MemoryFraction sets the maximum fraction of system memory available for
    // inference on unified memory architectures (Metal). Range: 0.0-1.0.
    // Defaults: 0.67 for systems ≤36GB, 0.75 for larger systems.
    // Ignored when MemoryBudget is set (absolute takes precedence).
    // +optional
    MemoryFraction *float64 `json:"memoryFraction,omitempty"`
}
```

### Agent Integration

- Metal agent reads `memoryFraction` / `memoryBudget` from the Model's HardwareSpec
- Agent flag `--memory-fraction` serves as the global default
- Per-model CRD values override the global default
- Pre-flight validation uses these values to determine if a model can be loaded

### CLI Integration

- `llmkube deploy --memory-fraction 0.8` for power users on dedicated machines
- `llmkube catalog info <model>` shows estimated memory requirement vs available budget
- `llmkube status` shows current memory utilization when Metal services are running

### Precedence Order

1. `memoryBudget` on Model CRD (absolute, highest priority)
2. `memoryFraction` on Model CRD
3. `--memory-fraction` agent flag (global default)
4. Built-in adaptive default (0.67 for ≤36GB, 0.75 for >36GB)

## Example Usage

```yaml
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: llama-3-1-8b
spec:
  source: "https://huggingface.co/..."
  hardware:
    accelerator: metal
    memoryFraction: 0.8  # Dedicated inference Mac, allow more memory
    gpu:
      layers: 32
```

```yaml
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: qwen-coder-32b
spec:
  source: "https://huggingface.co/..."
  hardware:
    accelerator: metal
    memoryBudget: "24Gi"  # Hard cap regardless of system size
    gpu:
      layers: 64
```

## References

- `api/v1alpha1/model_types.go` — HardwareSpec definition
- `pkg/agent/executor.go` — reads GPU config for process spawning
- `pkg/agent/agent.go` — agent configuration
- `cmd/metal-agent/main.go` — agent CLI flags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add memoryFraction and memoryBudget fields to CRD for unified memory control #187

Background

Current State

Proposed Changes

CRD Fields

Agent Integration

CLI Integration

Precedence Order

Example Usage

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add memoryFraction and memoryBudget fields to CRD for unified memory control #187

Description

Background

Current State

Proposed Changes

CRD Fields

Agent Integration

CLI Integration

Precedence Order

Example Usage

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions