feat(ds4): layer-split distributed inference by localai-bot · Pull Request #10098 · mudler/LocalAI

localai-bot · 2026-05-30T21:02:53Z

Summary

Adds manual distributed (layer-split) inference for the ds4 backend, mirroring the operator experience of LocalAI's llama.cpp manual workers.

ds4 splits transformer layers across machines: a coordinator listens and owns layers 0:K; workers dial in and own higher ranges (e.g. 20:output), each loading only its slice of the GGUF. This is the inverse of llama.cpp's RPC model (where the server dials out to workers).

ds4-worker binary (backend/cpp/ds4/worker_main.c): a small standalone worker that opens the engine in worker role and runs ds4_dist_run(). Links the existing engine objects + ds4_distributed.o; no gRPC/protobuf dependency. Built via a new CMake target and packaged next to grpc-server (Linux loader bundling + Darwin dylib walk handled).
Coordinator wiring (grpc-server.cpp): the backend becomes a coordinator when LoadModel options carry ds4_role:coordinator + ds4_layers:0:K + ds4_listen:host:port (+ optional ds4_route_timeout, default 60s). Predict/PredictStream wait for the worker route to cover all layers (without holding the engine mutex, so Status/Health probes stay responsive), else return gRPC UNAVAILABLE. Absent ds4_role, behavior is identical to today (fully back-compatible).
Manual worker CLI: local-ai worker ds4-distributed -- <ds4-worker args> resolves the ds4 backend and execs the packaged ds4-worker (via the bundled lib/ld.so loader when present), mirroring local-ai worker llama-cpp-rpc.
Docs (docs/content/features/distributed-mode.md, .agents/ds4-backend.md) and an opt-in hardware-gated e2e (tests/e2e-backends, gated on BACKEND_TEST_DS4_DISTRIBUTED=1).

Phase 2 (p2p auto-discovery) is designed but intentionally deferred to a follow-up.

Test Plan

go test ./core/cli/worker/... passes (worker CLI arg-assembly + constants, Ginkgo/Gomega)
go vet ./tests/e2e-backends/... clean; the new distributed spec skips cleanly without BACKEND_TEST_DS4_DISTRIBUTED
ds4-worker CPU link verified locally (all ds4_dist_* symbols resolve; arg validation exits cleanly)
golangci-lint --new-from-merge-base=master reports 0 issues on touched Go
CI builds grpc-server for cpu-ds4 / cuda13-ds4 (requires protobuf/grpc; not available on the dev host)
Hardware-gated multi-process run: BACKEND_TEST_DS4_DISTRIBUTED=1 BACKEND_BINARY=... BACKEND_TEST_MODEL_FILE=... go test ./tests/e2e-backends/...

🤖 Generated with Claude Code

Add worker_main.c, a minimal standalone worker that owns a slice of the model's transformer layers and serves activations over ds4's own TCP transport via ds4_dist_run(). It links the same engine objects the backend already builds (including ds4_distributed.o) and has NO gRPC/protobuf dependency, so it builds even on hosts lacking protobuf/grpc dev headers. Launched by `local-ai worker ds4-distributed`. Wire the ds4-worker CMake target (mirrors grpc-server's object/GPU/native handling) and have the Makefile copy + clean the binary alongside grpc-server. Ignore the built ds4-worker artifact. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Copy the standalone ds4-worker binary into the backend package (Linux package.sh) and the Darwin OCI tar (ds4-darwin.sh: both the explicit copy and the otool dylib-bundling loop) so distributed workers ship with the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Add distributed COORDINATOR support to the ds4 backend's gRPC server. Distributed inference is an engine backend: when LoadModel receives 'ds4_role:coordinator', the process populates ds4_engine_options.distributed (role, layer slice, listen host/port) before ds4_engine_open, then the normal ds4_session_* generation path runs transparently once the worker route covers all layers. - New LoadModel options: ds4_role, ds4_layers (START:END or START:output), ds4_listen (host:port), ds4_route_timeout. - parse_layers_spec() maps the layer spec onto ds4_distributed_layers. - wait_route_ready() blocks generation until ds4_session_distributed_route_ready() reports full coverage (or timeout), gating both Predict and PredictStream; returns UNAVAILABLE on timeout/error. - No ds4_role => g_distributed stays false and wait_route_ready is a no-op, so single-node behavior is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…opts Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Add the ds4WorkerArgs helper plus findDS4Backend/DS4Distributed.Run that resolve the ds4 backend via the gallery and exec the packaged ds4-worker binary. Unlike worker_llamacpp.go, ds4 bundles its own dynamic loader (lib/ld.so) for glibc compatibility, so when present we exec ds4-worker through that loader with LD_LIBRARY_PATH=<backend>/lib, mirroring backend/cpp/ds4/run.sh; otherwise we exec it directly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Wire DS4Distributed into the Worker kong command tree so `local-ai worker ds4-distributed` is available. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Add a ds4 section to the distributed-mode feature docs (coordinator model YAML, manual worker command, layer-range semantics, the 'GGUF on every machine' requirement, coordinator-listens dial direction vs llama.cpp) and a terse Distributed mode section to the ds4 backend agent guide. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Add a self-contained, opt-in Ginkgo spec to the backend e2e suite that spins a ds4 coordinator (via the packaged run.sh, loaded with ds4_role/ds4_layers/ds4_listen options) plus a ds4-worker process for the upper layers, then uses Eventually to assert a short successful Predict once the layer route forms, before tearing the worker down. Gated by BACKEND_TEST_DS4_DISTRIBUTED=1 (plus the existing BACKEND_BINARY + BACKEND_TEST_MODEL_FILE and optional layer/listen/accel knobs); compiles and skips cleanly with no env, hardware, or model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

… convention Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

The ds4-worker target is built from worker_main.c (C), so CMake linked it with the C driver. The nvcc-built ds4_cuda.o (and Obj-C++ ds4_metal.o) reference the C++ runtime, so the CUDA/Metal builds failed with undefined libstdc++ symbols (std::__throw_length_error). The CPU build passed because ds4_cpu.o is pure C. Force LINKER_LANGUAGE CXX so libstdc++ is linked. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler added 12 commits May 30, 2026 20:14

fix(ds4): tighten ds4-worker integer arg validation to match upstream

5fcbeab

Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

fix(ds4): don't block Status during route wait; validate coordinator …

2736093

…opts Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

feat(cli): register the ds4-distributed worker subcommand

9852dfe

Wire DS4Distributed into the Worker kong command tree so `local-ai worker ds4-distributed` is available. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

test(ds4): pass coordinator ctx to worker; lowercase error string

4250a73

Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

docs(ds4): note distributed transport is plaintext/unauthenticated

af7b2d4

Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

style(ds4): replace em dashes in distributed docs/agent/test per repo…

83e0e75

… convention Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler added the enhancement New feature or request label May 30, 2026

mudler changed the title ~~feat(ds4): layer-split distributed inference (Phase 1: manual coordinator + worker)~~ feat(ds4): layer-split distributed inference May 30, 2026

mudler merged commit 07f6c15 into master May 30, 2026
60 checks passed

mudler deleted the feat/ds4-distributed-impl branch May 30, 2026 22:09

BrewTestBot mentioned this pull request Jun 10, 2026

localai 4.4.0 Homebrew/homebrew-core#287347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ds4): layer-split distributed inference#10098

feat(ds4): layer-split distributed inference#10098
mudler merged 13 commits into
masterfrom
feat/ds4-distributed-impl

localai-bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 30, 2026

Summary

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants