Skip to content

Misc. bug: Router mode WebUI – model selection does not update correctly when unloading + loading a new model mid-chat #21626

@maddes8cht

Description

@maddes8cht

Name and Version

llama-server --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12287 MiB):
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 12287 MiB
load_backend: loaded RPC backend from g:\Llamacpp\bin\ggml-rpc.dll
load_backend: loaded CPU backend from g:\Llamacpp\bin\ggml-cpu-haswell.dll
version: 8709 (85d482e)
built with Clang 21.1.8 for Windows AMD64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

Problem description & steps to reproduce

Description

In llama-server router mode (using the built-in WebUI), switching models mid-conversation by unloading the current model and loading a new one does not work as expected.

The UI is highly confusing in this scenario, especially on low-VRAM setups like an RTX 3060 12 GB where you usually cannot keep two large models loaded at the same time.

Steps to Reproduce

  1. Start llama-server in router mode (e.g. with a low --models-max value or limited VRAM).
  2. In the WebUI, load and chat with Model A.
  3. Unload Model A.
  4. Select and start loading Model B.
  5. While Model B is loading (or right after it finishes), type a new message and send it.

What actually happens:

  • During loading of Model B, the model selector correctly shows Model B (sometimes with a "loading..." indicator).
  • As soon as Model B finishes loading, the selector silently jumps back to the previously unloaded Model A.
  • Any message sent during or right after loading is automatically routed to Model A instead of the newly loaded Model B. This causes Model A to be reloaded.
  • Result: You suddenly have two large models loaded simultaneously, exactly what you wanted to avoid. Responses become extremely slow.

Expected behavior:

  • Once Model B is selected and starts loading (or finishes loading), it should become the active model for the current chat.
  • The selector should stay on Model B after loading completes.
  • Any new message should be sent to the currently selected model in the UI (Model B), not silently redirected to the old one.

It is not obvious that you have to manually re-select Model B in the dropdown after it has fully loaded. Even though I now know this behavior, it still catches me repeatedly because it is highly counter-intuitive.

During the loading phase the UI visually suggests that Model B is already active, so users naturally assume the next message will go to Model B.

Motivation / Use Cases

This workflow is very common on hardware with limited VRAM (RTX 3060 12 GB):

  • Quickly testing different models in the same conversation
  • Switching to a smarter (but slower) model when the current one is struggling
  • Switching to a faster/smaller model when speed matters more than quality

Having to manually re-select the model after every load defeats the convenience of the Load/Unload buttons in router mode.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions