Skip to content

Model controller re-downloads cached models after controller upgrade #193

@Defilan

Description

@Defilan

Bug Description

After upgrading the LLMKube controller (e.g., helm upgrade from 0.4.15 to 0.4.20), the model controller re-downloads some models despite them already being present in the PVC cache. The cache check works correctly for some models but not others during the same reconciliation cycle.

Observed Behavior

Controller logs during post-upgrade reconciliation show:

# These models correctly hit the cache:
Model found in cache, skipping download  model=qwen3-30b-a3b  path=/models/6f5be1f92dc5a67b/model.gguf  size=18556686752
Model found in cache, skipping download  model=nomic-embed     path=/models/f4fd3185144e5d72/model.gguf  size=146146432

# This model starts re-downloading despite the file existing:
Started downloading model  model=qwen3-32b  source=https://huggingface.co/unsloth/Qwen3-32B-GGUF/resolve/main/Qwen3-32B-Q4_K_M.gguf  cacheKey=b305905cfb57fc0a
Downloading model  model=qwen3-32b  dest=/models/b305905cfb57fc0a/model.gguf

The file already exists on disk:

$ ls -lh /models/b305905cfb57fc0a/
-rw-r--r-- 1 101 102 19G Feb 20 00:48 model.gguf

The Model status shows contradictory information — the condition says "Model found in cache" but the phase is set to Downloading:

status:
  conditions:
    - message: Model found in cache
      reason: ModelCached
      status: "True"
      type: Available
  phase: Downloading

Impact

  • Wastes bandwidth re-downloading multi-GB model files unnecessarily
  • Can fill disk on nodes with limited storage
  • Delays InferenceService readiness (service stays in Pending while model "downloads")
  • The re-download blocks the InferenceService from transitioning to Ready even though the running pod already has the model loaded

Possible Causes

  • Race condition: multiple reconciliation loops fire simultaneously after controller restart, and one loop may check the cache before another has finished updating the status
  • The cache validation may be comparing file size against expected size, but the expected size field may not be populated for all models
  • The fetchModel() function may have a code path that bypasses the cache check under certain conditions

Environment

  • LLMKube version: 0.4.19 → 0.4.20 upgrade
  • Kubernetes: K3s on single node
  • PVC: 914G volume, 48% used
  • Models affected: qwen3-32b (19G) — re-downloaded
  • Models NOT affected: qwen3-30b-a3b (18.6G), nomic-embed (146M) — correctly cached

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcomponent/controllerRelated to the operator controller

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions