Skip to content

cli: add option to connect to server via http(s)#21674

Open
pwilkin wants to merge 2 commits into
ggml-org:masterfrom
pwilkin:llama-cli-remote
Open

cli: add option to connect to server via http(s)#21674
pwilkin wants to merge 2 commits into
ggml-org:masterfrom
pwilkin:llama-cli-remote

Conversation

@pwilkin

@pwilkin pwilkin commented Apr 9, 2026

Copy link
Copy Markdown
Member

Overview

Adds an --endpoint option to connect to an existing server instance.

Additional information

In many cases, people want to run a llama-server for various uses but also might want a quick test UI in cases where they cannot access the WebUI (i.e. pure console / terminal environments). Since llama-cli spawns a separate server instance, you cannot run both in VRAM-constrained environments, so having the option to run llama-cli with a llama-server endpoint seems desirable.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, although GLM 5.1 generated code with a goto in it, so I had to double-check.

@pwilkin pwilkin requested review from a team and ngxson as code owners April 9, 2026 11:53

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I'm not quite comfortable with this change. This adds too much for a feature that no one ever asked (via an issue)

If you really need to do this, just code your own CLI thing in higher-level languages like python or nodejs

Comment thread tools/cli/cli-remote.h Outdated
Comment on lines +12 to +37
struct cli_backend {
virtual ~cli_backend() = default;

// model / server info
virtual std::string get_model_name() const = 0;
virtual bool has_vision() const = 0;
virtual bool has_audio() const = 0;
virtual std::string get_build_info() const = 0;

// chat completion (streaming), returns assistant content text
virtual std::string generate_completion(
const json & messages,
const common_params & params,
bool verbose_prompt,
result_timings & out_timings) = 0;

// load a local text file, return its contents (empty string on failure)
virtual std::string load_text_file(const std::string & fname) = 0;

// load a local media file, return the OAI content part JSON for it
// returns empty JSON object on failure
virtual json load_media_file(const std::string & fname) = 0;

// cleanup
virtual void terminate() = 0;
};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this will add double the effort each time someone adds a new feature to the CLI

Not a wise choice for long-term maintenance. The CLI should either support native API or remote API, but not both

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I really do feel like having the remote API as the only one would be the better option. As in: it would add interoperability, it would make it simpler to implement the MCP / command execution stuff and it would remove the need to keep a separate track for accessing the server. And all it would take to retain the current functionality of launching the client and the server at the same time would be a simple wrapper.

Putting this up for consideration and converting this to draft for now.

@pwilkin pwilkin marked this pull request as draft April 9, 2026 16:29
@pwilkin

pwilkin commented Apr 9, 2026

Copy link
Copy Markdown
Member Author

@ngxson since we don't want double APIs, what do you think of a prototype here that does the following:

  • removes the cli-specific path
  • migrates everything to the http path
  • if run without --endpoint, launches a server on a random port that is closed as soon as the cli exits to mimic the previous cli functionality
    ?

@ngxson

ngxson commented Apr 9, 2026

Copy link
Copy Markdown
Collaborator

Honestly I don't have a strong opinion on whether the CLI should use native API, HTTP API or another IPC mechanism like unix socket. However, since most LLM CLI uses HTTP API under the hood, I agree that it may be better in the longer term to go with that for llama-cli.

I do have 2 concerns though:

  1. Currently, the CLI acts as an example on how easy it is to use llama.cpp as an external library (via binding; without being a HTTP server). If we choose to move CLI away from this, we must still need to add an example of doing so (though, can be much more basic than CLI)
  2. If we choose to use HTTP for CLI, we should no longer link CLI against libserver. The consequence is that CLI must either spawn an llama-server instance, or llama-server should be a daemon.

For the point (1), no actions is needed from your side, I will eventually implement it (which goes back to the idea of llamax library), there are many people already asking for a easy-to-use native API that accepts multimodal. However, for the point (2), I think we need to consider it more carefully.

@pwilkin

pwilkin commented Apr 9, 2026

Copy link
Copy Markdown
Member Author

@ngxson cross-platform daemon management can get really tricky, so I'd prefer not to go that route. I'd say spawning a llama-server instance that gets shut down at cli close would be the preferred way to go.

@ngxson

ngxson commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

hmm ok then, I think there is an extra edge case is where the server runs in router mode, but I think it can be handled in the future. would be nice when we finally have hot-swappable models on CLI though

in the longer term, I think having server as a daemon can be quite nice, but for now an --endpoint option can be a good start

Comment thread common/arg.cpp Outdated
}
).set_examples({LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_SHOW_TIMINGS"));
add_opt(common_arg(
{"--endpoint"}, "URL",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be a bit more specific, smth like --server-base

and in the future when server becomes a daemon, maybe with a fixed port like 8800, we can move --server-base to default mode and use the mentioned port

@pwilkin

pwilkin commented Apr 11, 2026

Copy link
Copy Markdown
Member Author

@ngxson so, experimental findings:

-> the cli works pretty nicely with llama-server spawning
-> but properly handling Ctrl+C is a PITA
-> needed to patch the library to allow detaching the server, otherwise the SIGINT would propagate and kill the server

When you're free take a look and tell me if this path is fine. Looked at the code and exorcised any stupid-stuff Kimi wrote. Parameter passing seems to work correctly (i.e. llama-cli -m <qwen> --reasoning off actually results in the model not reasoning).

@pwilkin pwilkin force-pushed the llama-cli-remote branch from 0f5cb2d to b6bcfbe Compare May 1, 2026 18:17
@pwilkin pwilkin marked this pull request as ready for review May 1, 2026 18:18
@pwilkin pwilkin requested a review from ggerganov as a code owner May 1, 2026 18:18
@pwilkin

pwilkin commented May 1, 2026

Copy link
Copy Markdown
Member Author

@ngxson rebased it on master, I guess we should go through with this, as this would unblock easy llama-cli tools integration + the MCP server-side server.

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is ok but I don't have a very strong feeling about the current implementation.

IMO the CLI can now be viewed as a frontend project (similar to the svelte web ui), so it should mirror the same frontend paradigm, like for example MVC or centralized state management. many CLI agents (if not all of them) are designed that way and I think that will be the proper impl.

since I already had a good picture in mind, probably the best is that I can vibe-code a half-working version with the proper architecture, and you can build upon that. does his way sound better to you?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a vendor lib, it should be use as-is, no modifications

Comment thread vendor/sheredom/subprocess.h Outdated
Comment on lines +94 to +97
// Spawn the child in a new session/process group so that terminal signals
// (e.g. SIGINT from Ctrl+C) are not forwarded to the child process.
// On Unix this uses setsid(); on Windows it uses CREATE_NEW_PROCESS_GROUP.
subprocess_option_new_session = 0x20

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite overkill IMO, just look at the server router mode, it simply monitor stdin/stdout between 2 processes and if one of the pipe breaks, exits itself

so even router instance crashes, it leaves no abandon children behind

Comment thread tools/cli/cli-backend.h Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from high-level view, I'm not very comfortable about this design because this notion of "backend" is not well-defined (e.g. no clear functional boundary)

  • first of, I don't think this should be called a "backend", it's better to be a "client", cli_client for example, refers to "API client"
  • secondly, cli_client should only provide the data, it should not modify the view, i.e. should never call console:: functions

cli_client should simply expose something like create_chat_completions and anything else should be provided as callback

process management (i.e. spawn_...) should be in its own class, smth like std::optional<cli_server> that will be, as the name suggested, optional as user might want to use a remote server

@pwilkin

pwilkin commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

@ngxson yeah, that sounds good, I saw that there's been a strong push towards the new architecture so if you could sketch the intended direction I'll gladly build on that.

@ngxson

ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

I drafted this gist to show the direction: https://gist.github.com/ngxson/ccc774a2441cbfc98ce082c31d94075a

for cli_server, I think create a whole new process is quite overkill; running server inside a thread should be enough, my impl is just 50 LOC to manage that

for cli_client:

  • it should be somewhat equivalent to openai python package
  • contains only HTTP-related functions
  • any real-time responses must be returned via callback (for example with SSE, a callback can be invoked multiple times)
  • it should not call anything related to console::
  • it's roughly equivalent to the "model" in MVC

the final part missing from my code above is the cli_context, but it will have something like:

struct cli_context {
    std::optional<cli_server> server; // only set if user doesn't specify remote server
    cli_client client; // always initialized
    // ...
};

the cli_context can be put into a new file cli-context.cpp|h if you like, it should manage the state of the whole CLI app and responsible to make calls to console:: to render the view. in other words, the cli_context is the "controller" in MVC

we don't have cli_view for now as the console:: is already the view layer. in the future, we can also have a TUI lib that is responsible for the view, but that will be a future improvement

lmk if that sounds good to you

@pwilkin

pwilkin commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Aight, I'll take a look and let you know (I think I get the general idea).

The CLI no longer links server-context; it always talks to llama-server
over HTTP. Following the MVC direction proposed in the PR review:

- cli_client (model): thin HTTP/SSE client for the llama-server API,
  data only, real-time responses delivered via callbacks
- cli_context (controller): owns the chat state and the interactive
  loop, renders through the view
- cli_view (view): user-facing input/output interface, implemented on
  top of common/console
- cli_server: optional local llama-server child process, managed with
  the same stdin/stdout protocol the server router mode uses for model
  instances (ready signal on stdout, exit on stdin EOF, kill on
  timeout), so no orphan process is left behind in either direction

By default llama-cli spawns a local llama-server on a free port and
forwards all server-relevant args to it; with --server-base URL it
connects to an external instance instead and model args are not
required. Media files are sent as base64 content parts, and chat
templating, reasoning parsing and sampling are now handled server-side.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pwilkin pwilkin force-pushed the llama-cli-remote branch from edc5863 to 28536fa Compare June 12, 2026 20:19
@pwilkin pwilkin requested a review from a team as a code owner June 12, 2026 20:19
The controller now only reports semantic events and data objects
(loading started, server info, response began, reasoning/content
deltas, timings, messages and errors); the view decides how to present
them: spinner handling, thinking markers, banner layout, command list
alignment, prompt markers, echo truncation and timing formats all live
in cli_view_console now.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pwilkin

pwilkin commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@ngxson Mkay, experimenting with remote CC a bit, told it to implement it according to the spec then yelled at it for not respecting MVC abstractions, think it looks decent now.

@ngxson

ngxson commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

to say honestly, I think this is now quite over-complicated / slop than I initially expected (for example, my points about spawning thread and point about not having view.cpp are not respected)

given I also have CC subscription paid by HF (plus pi agents with HF inference endpoints), probably better for me to just do it on my side. I can push a new PR tmr if you are ok with it.

the only part that I can likely salvage is the cli-client.cpp from this PR, I will put you in the Co-Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants