cli: add option to connect to server via http(s) by pwilkin · Pull Request #21674 · ggml-org/llama.cpp

pwilkin · 2026-04-09T11:53:02Z

Overview

Adds an --endpoint option to connect to an existing server instance.

Additional information

In many cases, people want to run a llama-server for various uses but also might want a quick test UI in cases where they cannot access the WebUI (i.e. pure console / terminal environments). Since llama-cli spawns a separate server instance, you cannot run both in VRAM-constrained environments, so having the option to run llama-cli with a llama-server endpoint seems desirable.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, although GLM 5.1 generated code with a goto in it, so I had to double-check.

ngxson

IMO I'm not quite comfortable with this change. This adds too much for a feature that no one ever asked (via an issue)

If you really need to do this, just code your own CLI thing in higher-level languages like python or nodejs

ngxson · 2026-04-09T16:21:47Z

+struct cli_backend {
+    virtual ~cli_backend() = default;
+
+    // model / server info
+    virtual std::string get_model_name() const = 0;
+    virtual bool has_vision() const = 0;
+    virtual bool has_audio() const = 0;
+    virtual std::string get_build_info() const = 0;
+
+    // chat completion (streaming), returns assistant content text
+    virtual std::string generate_completion(
+        const json & messages,
+        const common_params & params,
+        bool verbose_prompt,
+        result_timings & out_timings) = 0;
+
+    // load a local text file, return its contents (empty string on failure)
+    virtual std::string load_text_file(const std::string & fname) = 0;
+
+    // load a local media file, return the OAI content part JSON for it
+    // returns empty JSON object on failure
+    virtual json load_media_file(const std::string & fname) = 0;
+
+    // cleanup
+    virtual void terminate() = 0;
+};


I imagine this will add double the effort each time someone adds a new feature to the CLI

Not a wise choice for long-term maintenance. The CLI should either support native API or remote API, but not both

To be honest, I really do feel like having the remote API as the only one would be the better option. As in: it would add interoperability, it would make it simpler to implement the MCP / command execution stuff and it would remove the need to keep a separate track for accessing the server. And all it would take to retain the current functionality of launching the client and the server at the same time would be a simple wrapper.

Putting this up for consideration and converting this to draft for now.

pwilkin · 2026-04-09T16:34:21Z

@ngxson since we don't want double APIs, what do you think of a prototype here that does the following:

removes the cli-specific path
migrates everything to the http path
if run without --endpoint, launches a server on a random port that is closed as soon as the cli exits to mimic the previous cli functionality
?

ngxson · 2026-04-09T17:07:51Z

Honestly I don't have a strong opinion on whether the CLI should use native API, HTTP API or another IPC mechanism like unix socket. However, since most LLM CLI uses HTTP API under the hood, I agree that it may be better in the longer term to go with that for llama-cli.

I do have 2 concerns though:

Currently, the CLI acts as an example on how easy it is to use llama.cpp as an external library (via binding; without being a HTTP server). If we choose to move CLI away from this, we must still need to add an example of doing so (though, can be much more basic than CLI)
If we choose to use HTTP for CLI, we should no longer link CLI against libserver. The consequence is that CLI must either spawn an llama-server instance, or llama-server should be a daemon.

For the point (1), no actions is needed from your side, I will eventually implement it (which goes back to the idea of llamax library), there are many people already asking for a easy-to-use native API that accepts multimodal. However, for the point (2), I think we need to consider it more carefully.

pwilkin · 2026-04-09T17:15:02Z

@ngxson cross-platform daemon management can get really tricky, so I'd prefer not to go that route. I'd say spawning a llama-server instance that gets shut down at cli close would be the preferred way to go.

ngxson · 2026-04-10T13:51:37Z

hmm ok then, I think there is an extra edge case is where the server runs in router mode, but I think it can be handled in the future. would be nice when we finally have hot-swappable models on CLI though

in the longer term, I think having server as a daemon can be quite nice, but for now an --endpoint option can be a good start

ngxson · 2026-04-10T13:59:13Z

        }
    ).set_examples({LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_SHOW_TIMINGS"));
+    add_opt(common_arg(
+        {"--endpoint"}, "URL",


IMO this should be a bit more specific, smth like --server-base

and in the future when server becomes a daemon, maybe with a fixed port like 8800, we can move --server-base to default mode and use the mentioned port

pwilkin · 2026-04-11T20:03:33Z

@ngxson so, experimental findings:

-> the cli works pretty nicely with llama-server spawning
-> but properly handling Ctrl+C is a PITA
-> needed to patch the library to allow detaching the server, otherwise the SIGINT would propagate and kill the server

When you're free take a look and tell me if this path is fine. Looked at the code and exorcised any stupid-stuff Kimi wrote. Parameter passing seems to work correctly (i.e. llama-cli -m <qwen> --reasoning off actually results in the model not reasoning).

pwilkin · 2026-05-01T18:19:32Z

@ngxson rebased it on master, I guess we should go through with this, as this would unblock easy llama-cli tools integration + the MCP server-side server.

ngxson

the idea is ok but I don't have a very strong feeling about the current implementation.

IMO the CLI can now be viewed as a frontend project (similar to the svelte web ui), so it should mirror the same frontend paradigm, like for example MVC or centralized state management. many CLI agents (if not all of them) are designed that way and I think that will be the proper impl.

since I already had a good picture in mind, probably the best is that I can vibe-code a half-working version with the proper architecture, and you can build upon that. does his way sound better to you?

ngxson · 2026-06-08T21:40:05Z

this is a vendor lib, it should be use as-is, no modifications

ngxson · 2026-06-08T21:41:39Z

+  // Spawn the child in a new session/process group so that terminal signals
+  // (e.g. SIGINT from Ctrl+C) are not forwarded to the child process.
+  // On Unix this uses setsid(); on Windows it uses CREATE_NEW_PROCESS_GROUP.
+  subprocess_option_new_session = 0x20


this is quite overkill IMO, just look at the server router mode, it simply monitor stdin/stdout between 2 processes and if one of the pipe breaks, exits itself

so even router instance crashes, it leaves no abandon children behind

ngxson · 2026-06-08T22:06:24Z

from high-level view, I'm not very comfortable about this design because this notion of "backend" is not well-defined (e.g. no clear functional boundary)

first of, I don't think this should be called a "backend", it's better to be a "client", cli_client for example, refers to "API client"

secondly, cli_client should only provide the data, it should not modify the view, i.e. should never call console:: functions

cli_client should simply expose something like create_chat_completions and anything else should be provided as callback

process management (i.e. spawn_...) should be in its own class, smth like std::optional<cli_server> that will be, as the name suggested, optional as user might want to use a remote server

pwilkin · 2026-06-09T05:27:31Z

@ngxson yeah, that sounds good, I saw that there's been a strong push towards the new architecture so if you could sketch the intended direction I'll gladly build on that.

ngxson · 2026-06-09T11:05:10Z

I drafted this gist to show the direction: https://gist.github.com/ngxson/ccc774a2441cbfc98ce082c31d94075a

for cli_server, I think create a whole new process is quite overkill; running server inside a thread should be enough, my impl is just 50 LOC to manage that

for cli_client:

it should be somewhat equivalent to openai python package
contains only HTTP-related functions
any real-time responses must be returned via callback (for example with SSE, a callback can be invoked multiple times)
it should not call anything related to console::
it's roughly equivalent to the "model" in MVC

the final part missing from my code above is the cli_context, but it will have something like:

struct cli_context {
    std::optional<cli_server> server; // only set if user doesn't specify remote server
    cli_client client; // always initialized
    // ...
};

the cli_context can be put into a new file cli-context.cpp|h if you like, it should manage the state of the whole CLI app and responsible to make calls to console:: to render the view. in other words, the cli_context is the "controller" in MVC

we don't have cli_view for now as the console:: is already the view layer. in the future, we can also have a TUI lib that is responsible for the view, but that will be a future improvement

lmk if that sounds good to you

pwilkin · 2026-06-09T11:28:47Z

Aight, I'll take a look and let you know (I think I get the general idea).

The CLI no longer links server-context; it always talks to llama-server over HTTP. Following the MVC direction proposed in the PR review: - cli_client (model): thin HTTP/SSE client for the llama-server API, data only, real-time responses delivered via callbacks - cli_context (controller): owns the chat state and the interactive loop, renders through the view - cli_view (view): user-facing input/output interface, implemented on top of common/console - cli_server: optional local llama-server child process, managed with the same stdin/stdout protocol the server router mode uses for model instances (ready signal on stdout, exit on stdin EOF, kill on timeout), so no orphan process is left behind in either direction By default llama-cli spawns a local llama-server on a free port and forwards all server-relevant args to it; with --server-base URL it connects to an external instance instead and model args are not required. Media files are sent as base64 content parts, and chat templating, reasoning parsing and sampling are now handled server-side. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The controller now only reports semantic events and data objects (loading started, server info, response began, reasoning/content deltas, timings, messages and errors); the view decides how to present them: spinner handling, thinking markers, banner layout, command list alignment, prompt markers, echo truncation and timing formats all live in cli_view_console now. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pwilkin · 2026-06-12T23:17:18Z

@ngxson Mkay, experimenting with remote CC a bit, told it to implement it according to the spec then yelled at it for not respecting MVC abstractions, think it looks decent now.

ngxson · 2026-06-12T23:31:47Z

to say honestly, I think this is now quite over-complicated / slop than I initially expected (for example, my points about spawning thread and point about not having view.cpp are not respected)

given I also have CC subscription paid by HF (plus pi agents with HF inference endpoints), probably better for me to just do it on my side. I can push a new PR tmr if you are ok with it.

the only part that I can likely salvage is the cli-client.cpp from this PR, I will put you in the Co-Author

pwilkin requested review from a team and ngxson as code owners April 9, 2026 11:53

ngxson requested changes Apr 9, 2026

View reviewed changes

ngxson reviewed Apr 9, 2026

View reviewed changes

pwilkin marked this pull request as draft April 9, 2026 16:29

github-actions Bot added the examples label Apr 9, 2026

ngxson reviewed Apr 10, 2026

View reviewed changes

pwilkin force-pushed the llama-cli-remote branch from d77c2e4 to 42ae57c Compare April 10, 2026 15:50

ngxson mentioned this pull request Apr 15, 2026

libs : rename libcommon -> libllama-common #21936

Merged

pwilkin force-pushed the llama-cli-remote branch from 0f5cb2d to b6bcfbe Compare May 1, 2026 18:17

pwilkin marked this pull request as ready for review May 1, 2026 18:18

pwilkin requested a review from ggerganov as a code owner May 1, 2026 18:18

pwilkin mentioned this pull request Jun 3, 2026

server: (router) add model management API #23976

Open

3 tasks

ngxson reviewed Jun 8, 2026

View reviewed changes

pwilkin force-pushed the llama-cli-remote branch from edc5863 to 28536fa Compare June 12, 2026 20:19

pwilkin requested a review from a team as a code owner June 12, 2026 20:19

github-actions Bot added the server label Jun 12, 2026

Conversation

pwilkin commented Apr 9, 2026

Overview

Additional information

Requirements

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Apr 9, 2026

Uh oh!

ngxson commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Apr 9, 2026

Uh oh!

ngxson commented Apr 10, 2026

Uh oh!

ngxson Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Apr 11, 2026

Uh oh!

pwilkin commented May 1, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Jun 9, 2026

Uh oh!

ngxson commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Jun 9, 2026

Uh oh!

pwilkin commented Jun 12, 2026

Uh oh!

ngxson commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson left a comment •

edited

Loading

ngxson commented Apr 9, 2026 •

edited

Loading

ngxson commented Jun 9, 2026 •

edited

Loading