-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Labels
bugSomething isn't workingSomething isn't workingcomponent/controllerRelated to the operator controllerRelated to the operator controller
Description
Bug Description
After upgrading the LLMKube controller (e.g., helm upgrade from 0.4.15 to 0.4.20), the model controller re-downloads some models despite them already being present in the PVC cache. The cache check works correctly for some models but not others during the same reconciliation cycle.
Observed Behavior
Controller logs during post-upgrade reconciliation show:
# These models correctly hit the cache:
Model found in cache, skipping download model=qwen3-30b-a3b path=/models/6f5be1f92dc5a67b/model.gguf size=18556686752
Model found in cache, skipping download model=nomic-embed path=/models/f4fd3185144e5d72/model.gguf size=146146432
# This model starts re-downloading despite the file existing:
Started downloading model model=qwen3-32b source=https://huggingface.co/unsloth/Qwen3-32B-GGUF/resolve/main/Qwen3-32B-Q4_K_M.gguf cacheKey=b305905cfb57fc0a
Downloading model model=qwen3-32b dest=/models/b305905cfb57fc0a/model.gguf
The file already exists on disk:
$ ls -lh /models/b305905cfb57fc0a/
-rw-r--r-- 1 101 102 19G Feb 20 00:48 model.gguf
The Model status shows contradictory information — the condition says "Model found in cache" but the phase is set to Downloading:
status:
conditions:
- message: Model found in cache
reason: ModelCached
status: "True"
type: Available
phase: DownloadingImpact
- Wastes bandwidth re-downloading multi-GB model files unnecessarily
- Can fill disk on nodes with limited storage
- Delays InferenceService readiness (service stays in
Pendingwhile model "downloads") - The re-download blocks the InferenceService from transitioning to
Readyeven though the running pod already has the model loaded
Possible Causes
- Race condition: multiple reconciliation loops fire simultaneously after controller restart, and one loop may check the cache before another has finished updating the status
- The cache validation may be comparing file size against expected size, but the expected size field may not be populated for all models
- The
fetchModel()function may have a code path that bypasses the cache check under certain conditions
Environment
- LLMKube version: 0.4.19 → 0.4.20 upgrade
- Kubernetes: K3s on single node
- PVC: 914G volume, 48% used
- Models affected: qwen3-32b (19G) — re-downloaded
- Models NOT affected: qwen3-30b-a3b (18.6G), nomic-embed (146M) — correctly cached
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcomponent/controllerRelated to the operator controllerRelated to the operator controller