Skip to content

feat(ds4): layer-split distributed inference#10098

Merged
mudler merged 13 commits into
masterfrom
feat/ds4-distributed-impl
May 30, 2026
Merged

feat(ds4): layer-split distributed inference#10098
mudler merged 13 commits into
masterfrom
feat/ds4-distributed-impl

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Summary

Adds manual distributed (layer-split) inference for the ds4 backend, mirroring the operator experience of LocalAI's llama.cpp manual workers.

ds4 splits transformer layers across machines: a coordinator listens and owns layers 0:K; workers dial in and own higher ranges (e.g. 20:output), each loading only its slice of the GGUF. This is the inverse of llama.cpp's RPC model (where the server dials out to workers).

  • ds4-worker binary (backend/cpp/ds4/worker_main.c): a small standalone worker that opens the engine in worker role and runs ds4_dist_run(). Links the existing engine objects + ds4_distributed.o; no gRPC/protobuf dependency. Built via a new CMake target and packaged next to grpc-server (Linux loader bundling + Darwin dylib walk handled).
  • Coordinator wiring (grpc-server.cpp): the backend becomes a coordinator when LoadModel options carry ds4_role:coordinator + ds4_layers:0:K + ds4_listen:host:port (+ optional ds4_route_timeout, default 60s). Predict/PredictStream wait for the worker route to cover all layers (without holding the engine mutex, so Status/Health probes stay responsive), else return gRPC UNAVAILABLE. Absent ds4_role, behavior is identical to today (fully back-compatible).
  • Manual worker CLI: local-ai worker ds4-distributed -- <ds4-worker args> resolves the ds4 backend and execs the packaged ds4-worker (via the bundled lib/ld.so loader when present), mirroring local-ai worker llama-cpp-rpc.
  • Docs (docs/content/features/distributed-mode.md, .agents/ds4-backend.md) and an opt-in hardware-gated e2e (tests/e2e-backends, gated on BACKEND_TEST_DS4_DISTRIBUTED=1).

Phase 2 (p2p auto-discovery) is designed but intentionally deferred to a follow-up.

Test Plan

  • go test ./core/cli/worker/... passes (worker CLI arg-assembly + constants, Ginkgo/Gomega)
  • go vet ./tests/e2e-backends/... clean; the new distributed spec skips cleanly without BACKEND_TEST_DS4_DISTRIBUTED
  • ds4-worker CPU link verified locally (all ds4_dist_* symbols resolve; arg validation exits cleanly)
  • golangci-lint --new-from-merge-base=master reports 0 issues on touched Go
  • CI builds grpc-server for cpu-ds4 / cuda13-ds4 (requires protobuf/grpc; not available on the dev host)
  • Hardware-gated multi-process run: BACKEND_TEST_DS4_DISTRIBUTED=1 BACKEND_BINARY=... BACKEND_TEST_MODEL_FILE=... go test ./tests/e2e-backends/...

🤖 Generated with Claude Code

mudler added 12 commits May 30, 2026 20:14
Add worker_main.c, a minimal standalone worker that owns a slice of the
model's transformer layers and serves activations over ds4's own TCP
transport via ds4_dist_run(). It links the same engine objects the
backend already builds (including ds4_distributed.o) and has NO
gRPC/protobuf dependency, so it builds even on hosts lacking protobuf/grpc
dev headers. Launched by `local-ai worker ds4-distributed`.

Wire the ds4-worker CMake target (mirrors grpc-server's object/GPU/native
handling) and have the Makefile copy + clean the binary alongside
grpc-server. Ignore the built ds4-worker artifact.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Copy the standalone ds4-worker binary into the backend package (Linux
package.sh) and the Darwin OCI tar (ds4-darwin.sh: both the explicit copy
and the otool dylib-bundling loop) so distributed workers ship with the
backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add distributed COORDINATOR support to the ds4 backend's gRPC server.
Distributed inference is an engine backend: when LoadModel receives
'ds4_role:coordinator', the process populates ds4_engine_options.distributed
(role, layer slice, listen host/port) before ds4_engine_open, then the normal
ds4_session_* generation path runs transparently once the worker route covers
all layers.

- New LoadModel options: ds4_role, ds4_layers (START:END or START:output),
  ds4_listen (host:port), ds4_route_timeout.
- parse_layers_spec() maps the layer spec onto ds4_distributed_layers.
- wait_route_ready() blocks generation until
  ds4_session_distributed_route_ready() reports full coverage (or timeout),
  gating both Predict and PredictStream; returns UNAVAILABLE on timeout/error.
- No ds4_role => g_distributed stays false and wait_route_ready is a no-op,
  so single-node behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…opts

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add the ds4WorkerArgs helper plus findDS4Backend/DS4Distributed.Run that
resolve the ds4 backend via the gallery and exec the packaged ds4-worker
binary. Unlike worker_llamacpp.go, ds4 bundles its own dynamic loader
(lib/ld.so) for glibc compatibility, so when present we exec ds4-worker
through that loader with LD_LIBRARY_PATH=<backend>/lib, mirroring
backend/cpp/ds4/run.sh; otherwise we exec it directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Wire DS4Distributed into the Worker kong command tree so
`local-ai worker ds4-distributed` is available.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add a ds4 section to the distributed-mode feature docs (coordinator
model YAML, manual worker command, layer-range semantics, the
'GGUF on every machine' requirement, coordinator-listens dial
direction vs llama.cpp) and a terse Distributed mode section to the
ds4 backend agent guide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add a self-contained, opt-in Ginkgo spec to the backend e2e suite that
spins a ds4 coordinator (via the packaged run.sh, loaded with
ds4_role/ds4_layers/ds4_listen options) plus a ds4-worker process for
the upper layers, then uses Eventually to assert a short successful
Predict once the layer route forms, before tearing the worker down.

Gated by BACKEND_TEST_DS4_DISTRIBUTED=1 (plus the existing
BACKEND_BINARY + BACKEND_TEST_MODEL_FILE and optional layer/listen/accel
knobs); compiles and skips cleanly with no env, hardware, or model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
… convention

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler added the enhancement New feature or request label May 30, 2026
@mudler mudler changed the title feat(ds4): layer-split distributed inference (Phase 1: manual coordinator + worker) feat(ds4): layer-split distributed inference May 30, 2026
The ds4-worker target is built from worker_main.c (C), so CMake linked it
with the C driver. The nvcc-built ds4_cuda.o (and Obj-C++ ds4_metal.o)
reference the C++ runtime, so the CUDA/Metal builds failed with undefined
libstdc++ symbols (std::__throw_length_error). The CPU build passed because
ds4_cpu.o is pure C. Force LINKER_LANGUAGE CXX so libstdc++ is linked.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler merged commit 07f6c15 into master May 30, 2026
60 checks passed
@mudler mudler deleted the feat/ds4-distributed-impl branch May 30, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants