cli: add option to connect to server via http(s)#21674
Conversation
| struct cli_backend { | ||
| virtual ~cli_backend() = default; | ||
|
|
||
| // model / server info | ||
| virtual std::string get_model_name() const = 0; | ||
| virtual bool has_vision() const = 0; | ||
| virtual bool has_audio() const = 0; | ||
| virtual std::string get_build_info() const = 0; | ||
|
|
||
| // chat completion (streaming), returns assistant content text | ||
| virtual std::string generate_completion( | ||
| const json & messages, | ||
| const common_params & params, | ||
| bool verbose_prompt, | ||
| result_timings & out_timings) = 0; | ||
|
|
||
| // load a local text file, return its contents (empty string on failure) | ||
| virtual std::string load_text_file(const std::string & fname) = 0; | ||
|
|
||
| // load a local media file, return the OAI content part JSON for it | ||
| // returns empty JSON object on failure | ||
| virtual json load_media_file(const std::string & fname) = 0; | ||
|
|
||
| // cleanup | ||
| virtual void terminate() = 0; | ||
| }; |
There was a problem hiding this comment.
I imagine this will add double the effort each time someone adds a new feature to the CLI
Not a wise choice for long-term maintenance. The CLI should either support native API or remote API, but not both
There was a problem hiding this comment.
To be honest, I really do feel like having the remote API as the only one would be the better option. As in: it would add interoperability, it would make it simpler to implement the MCP / command execution stuff and it would remove the need to keep a separate track for accessing the server. And all it would take to retain the current functionality of launching the client and the server at the same time would be a simple wrapper.
Putting this up for consideration and converting this to draft for now.
|
@ngxson since we don't want double APIs, what do you think of a prototype here that does the following:
|
|
Honestly I don't have a strong opinion on whether the CLI should use native API, HTTP API or another IPC mechanism like unix socket. However, since most LLM CLI uses HTTP API under the hood, I agree that it may be better in the longer term to go with that for llama-cli. I do have 2 concerns though:
For the point (1), no actions is needed from your side, I will eventually implement it (which goes back to the idea of |
|
@ngxson cross-platform daemon management can get really tricky, so I'd prefer not to go that route. I'd say spawning a |
|
hmm ok then, I think there is an extra edge case is where the server runs in router mode, but I think it can be handled in the future. would be nice when we finally have hot-swappable models on CLI though in the longer term, I think having server as a daemon can be quite nice, but for now an |
| } | ||
| ).set_examples({LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_SHOW_TIMINGS")); | ||
| add_opt(common_arg( | ||
| {"--endpoint"}, "URL", |
There was a problem hiding this comment.
IMO this should be a bit more specific, smth like --server-base
and in the future when server becomes a daemon, maybe with a fixed port like 8800, we can move --server-base to default mode and use the mentioned port
d77c2e4 to
42ae57c
Compare
|
@ngxson so, experimental findings: -> the cli works pretty nicely with llama-server spawning When you're free take a look and tell me if this path is fine. Looked at the code and exorcised any stupid-stuff Kimi wrote. Parameter passing seems to work correctly (i.e. |
|
@ngxson rebased it on master, I guess we should go through with this, as this would unblock easy |
ngxson
left a comment
There was a problem hiding this comment.
the idea is ok but I don't have a very strong feeling about the current implementation.
IMO the CLI can now be viewed as a frontend project (similar to the svelte web ui), so it should mirror the same frontend paradigm, like for example MVC or centralized state management. many CLI agents (if not all of them) are designed that way and I think that will be the proper impl.
since I already had a good picture in mind, probably the best is that I can vibe-code a half-working version with the proper architecture, and you can build upon that. does his way sound better to you?
There was a problem hiding this comment.
this is a vendor lib, it should be use as-is, no modifications
| // Spawn the child in a new session/process group so that terminal signals | ||
| // (e.g. SIGINT from Ctrl+C) are not forwarded to the child process. | ||
| // On Unix this uses setsid(); on Windows it uses CREATE_NEW_PROCESS_GROUP. | ||
| subprocess_option_new_session = 0x20 |
There was a problem hiding this comment.
this is quite overkill IMO, just look at the server router mode, it simply monitor stdin/stdout between 2 processes and if one of the pipe breaks, exits itself
so even router instance crashes, it leaves no abandon children behind
There was a problem hiding this comment.
from high-level view, I'm not very comfortable about this design because this notion of "backend" is not well-defined (e.g. no clear functional boundary)
- first of, I don't think this should be called a "backend", it's better to be a "client",
cli_clientfor example, refers to "API client" - secondly,
cli_clientshould only provide the data, it should not modify the view, i.e. should never callconsole::functions
cli_client should simply expose something like create_chat_completions and anything else should be provided as callback
process management (i.e. spawn_...) should be in its own class, smth like std::optional<cli_server> that will be, as the name suggested, optional as user might want to use a remote server
|
@ngxson yeah, that sounds good, I saw that there's been a strong push towards the new architecture so if you could sketch the intended direction I'll gladly build on that. |
|
I drafted this gist to show the direction: https://gist.github.com/ngxson/ccc774a2441cbfc98ce082c31d94075a for for
the final part missing from my code above is the struct cli_context {
std::optional<cli_server> server; // only set if user doesn't specify remote server
cli_client client; // always initialized
// ...
};the we don't have lmk if that sounds good to you |
|
Aight, I'll take a look and let you know (I think I get the general idea). |
The CLI no longer links server-context; it always talks to llama-server over HTTP. Following the MVC direction proposed in the PR review: - cli_client (model): thin HTTP/SSE client for the llama-server API, data only, real-time responses delivered via callbacks - cli_context (controller): owns the chat state and the interactive loop, renders through the view - cli_view (view): user-facing input/output interface, implemented on top of common/console - cli_server: optional local llama-server child process, managed with the same stdin/stdout protocol the server router mode uses for model instances (ready signal on stdout, exit on stdin EOF, kill on timeout), so no orphan process is left behind in either direction By default llama-cli spawns a local llama-server on a free port and forwards all server-relevant args to it; with --server-base URL it connects to an external instance instead and model args are not required. Media files are sent as base64 content parts, and chat templating, reasoning parsing and sampling are now handled server-side. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
edc5863 to
28536fa
Compare
The controller now only reports semantic events and data objects (loading started, server info, response began, reasoning/content deltas, timings, messages and errors); the view decides how to present them: spinner handling, thinking markers, banner layout, command list alignment, prompt markers, echo truncation and timing formats all live in cli_view_console now. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@ngxson Mkay, experimenting with remote CC a bit, told it to implement it according to the spec then yelled at it for not respecting MVC abstractions, think it looks decent now. |
|
to say honestly, I think this is now quite over-complicated / slop than I initially expected (for example, my points about spawning thread and point about not having view.cpp are not respected) given I also have CC subscription paid by HF (plus pi agents with HF inference endpoints), probably better for me to just do it on my side. I can push a new PR tmr if you are ok with it. the only part that I can likely salvage is the cli-client.cpp from this PR, I will put you in the Co-Author |
Overview
Adds an
--endpointoption to connect to an existing server instance.Additional information
In many cases, people want to run a
llama-serverfor various uses but also might want a quick test UI in cases where they cannot access the WebUI (i.e. pure console / terminal environments). Sincellama-clispawns a separate server instance, you cannot run both in VRAM-constrained environments, so having the option to runllama-cliwith allama-serverendpoint seems desirable.Requirements
gotoin it, so I had to double-check.