fix(server): guard two mapping[] accesses against silent default-insert#77
Merged
Conversation
Epoch #73 task 4 (bug-hunt on server-models.cpp). Two race-condition latent bugs in router state machine. Both used `std::map::operator[]` where `find()` was needed; silent default-insert on miss produced false-positive predicate / phantom mapping entries. ## Bug 1: unload_lru() cv.wait predicate (line 762) ``` cv.wait(lk, [&]() { return mapping[lru_model_name].meta.status == SERVER_MODEL_STATUS_UNLOADED; }); ``` The default-init `server_model_meta` has `status = SERVER_MODEL_STATUS_UNLOADED`, so this predicate is spuriously true if the model is missing — AND it inserts a garbage entry into `mapping` as a side effect. Race: another thread (or a reload) unloads/removes the model between `unload(lru_model_name)` (which released its own lock) and our re-acquire. Predicate returns true on the inserted default; we proceed thinking the unload completed, but mapping is now polluted with a phantom entry. Fix: `find()`; missing → treat as done. ## Bug 2: load() after cv.wait(!is_reloading) (line 779) ``` cv.wait(lk, [this]() { return !is_reloading; }); auto meta = mapping[name].meta; if (meta.status != SERVER_MODEL_STATUS_UNLOADED) { ... return; } ``` `has_model()` was checked earlier under its own lock-cycle, then `unload_lru()` cycled the lock, then we re-acquire here. If a reload ran in that window and erased `name` from mapping, `mapping[name]` silently inserts a default. The early-return is bypassed (default status = UNLOADED) and we proceed to spawn a child with empty preset args. Fix: `find()`; missing → log + bail. ## Verified - ✅ `cmake --build build --target llama-server` succeeds locally - ✅ No behavior change on the non-racy path (find()+early-return on miss is operationally equivalent to operator[]+early-return on default-UNLOADED, minus the silent insertion) ## Out of scope Other `mapping[]` sites in load() (lines 1020, 1031) are reachable only while the lock is held continuously from the now-guarded read at 779, so no race exists there. `mapping[name]` at line 1186 (`proxy_request`) also reads under-lock after `get_meta()` confirmed presence.
This was referenced Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Epoch #73 task 4. Bug-hunt sweep on tools/server/server-models.cpp found two race-condition latent bugs in the router state machine. Both used
std::map::operator[]wherefind()was needed.Bug 1: unload_lru() cv.wait predicate (line 762)
cv.wait(lk, [&]() { return mapping[lru_model_name].meta.status == SERVER_MODEL_STATUS_UNLOADED; });Default-init
server_model_meta::status=SERVER_MODEL_STATUS_UNLOADED. Iflru_model_nameis missing (concurrent reload erased it, or another thread already unloaded + removed),operator[]silently inserts a default — predicate is spuriously true AND mapping is polluted with a phantom entry.Fix:
find(); missing → treat as done (no entry to wait for).Bug 2: load() after cv.wait(!is_reloading) (line 779)
has_model(name)was checked at line 770 under its own lock-cycle, thenunload_lru()cycled the lock, then we re-acquire here. A reload in that window could erasenamefrom mapping;mapping[name]then silently default-inserts. The early-return check (status != UNLOADED) is bypassed (default = UNLOADED) and we proceed to spawn a child with empty preset args.Fix:
find(); missing → log + bail.Out of scope
Other
mapping[]sites in load() (lines 1020, 1031) are reachable only while the lock is held continuously from the now-guarded read at 779, so no race exists.mapping[name]at line 1186 (proxy_request) also reads under-lock afterget_meta()confirmed presence.Verified
cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-serversucceedsfind()+ early-return on miss is operationally equivalent tooperator[]+ early-return on default-UNLOADED, minus the silent insertion