Skip to content

fix(server): guard two mapping[] accesses against silent default-insert#77

Merged
marksverdhei merged 1 commit into
htfrom
fix/server-models-mapping-races
Jun 12, 2026
Merged

fix(server): guard two mapping[] accesses against silent default-insert#77
marksverdhei merged 1 commit into
htfrom
fix/server-models-mapping-races

Conversation

@marksverdhei

Copy link
Copy Markdown

Epoch #73 task 4. Bug-hunt sweep on tools/server/server-models.cpp found two race-condition latent bugs in the router state machine. Both used std::map::operator[] where find() was needed.

Bug 1: unload_lru() cv.wait predicate (line 762)

cv.wait(lk, [&]() {
    return mapping[lru_model_name].meta.status == SERVER_MODEL_STATUS_UNLOADED;
});

Default-init server_model_meta::status = SERVER_MODEL_STATUS_UNLOADED. If lru_model_name is missing (concurrent reload erased it, or another thread already unloaded + removed), operator[] silently inserts a default — predicate is spuriously true AND mapping is polluted with a phantom entry.

Fix: find(); missing → treat as done (no entry to wait for).

Bug 2: load() after cv.wait(!is_reloading) (line 779)

cv.wait(lk, [this]() { return !is_reloading; });
auto meta = mapping[name].meta;

has_model(name) was checked at line 770 under its own lock-cycle, then unload_lru() cycled the lock, then we re-acquire here. A reload in that window could erase name from mapping; mapping[name] then silently default-inserts. The early-return check (status != UNLOADED) is bypassed (default = UNLOADED) and we proceed to spawn a child with empty preset args.

Fix: find(); missing → log + bail.

Out of scope

Other mapping[] sites in load() (lines 1020, 1031) are reachable only while the lock is held continuously from the now-guarded read at 779, so no race exists. mapping[name] at line 1186 (proxy_request) also reads under-lock after get_meta() confirmed presence.

Verified

  • cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-server succeeds
  • ✅ No behavior change on the non-racy path: find() + early-return on miss is operationally equivalent to operator[] + early-return on default-UNLOADED, minus the silent insertion

Epoch #73 task 4 (bug-hunt on server-models.cpp).

Two race-condition latent bugs in router state machine. Both used
`std::map::operator[]` where `find()` was needed; silent default-insert
on miss produced false-positive predicate / phantom mapping entries.

## Bug 1: unload_lru() cv.wait predicate (line 762)

```
cv.wait(lk, [&]() {
    return mapping[lru_model_name].meta.status == SERVER_MODEL_STATUS_UNLOADED;
});
```

The default-init `server_model_meta` has `status = SERVER_MODEL_STATUS_UNLOADED`,
so this predicate is spuriously true if the model is missing — AND it
inserts a garbage entry into `mapping` as a side effect.

Race: another thread (or a reload) unloads/removes the model between
`unload(lru_model_name)` (which released its own lock) and our re-acquire.
Predicate returns true on the inserted default; we proceed thinking the
unload completed, but mapping is now polluted with a phantom entry.

Fix: `find()`; missing → treat as done.

## Bug 2: load() after cv.wait(!is_reloading) (line 779)

```
cv.wait(lk, [this]() { return !is_reloading; });
auto meta = mapping[name].meta;
if (meta.status != SERVER_MODEL_STATUS_UNLOADED) { ... return; }
```

`has_model()` was checked earlier under its own lock-cycle, then
`unload_lru()` cycled the lock, then we re-acquire here. If a reload
ran in that window and erased `name` from mapping, `mapping[name]`
silently inserts a default. The early-return is bypassed
(default status = UNLOADED) and we proceed to spawn a child with
empty preset args.

Fix: `find()`; missing → log + bail.

## Verified

- ✅ `cmake --build build --target llama-server` succeeds locally
- ✅ No behavior change on the non-racy path (find()+early-return on miss
   is operationally equivalent to operator[]+early-return on
   default-UNLOADED, minus the silent insertion)

## Out of scope

Other `mapping[]` sites in load() (lines 1020, 1031) are reachable only
while the lock is held continuously from the now-guarded read at 779, so
no race exists there. `mapping[name]` at line 1186 (`proxy_request`)
also reads under-lock after `get_meta()` confirmed presence.
@marksverdhei marksverdhei merged commit a9e0813 into ht Jun 12, 2026
5 of 8 checks passed
@marksverdhei marksverdhei deleted the fix/server-models-mapping-races branch June 12, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant