Skip to content

fix(distributed): stage directory-based models to remote nodes#10175

Merged
mudler merged 1 commit into
masterfrom
fix/distributed-stage-directory-models
Jun 4, 2026
Merged

fix(distributed): stage directory-based models to remote nodes#10175
mudler merged 1 commit into
masterfrom
fix/distributed-stage-directory-models

Conversation

@localai-bot

@localai-bot localai-bot commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Problem

In distributed mode, routing a directory-based model to a remote NATS worker fails with:

staging model files for node dgx-spark: staging model file:
uploading /models/qwen3-tts-cpp to node ...:
Put ".../v1/files/models/qwen3-tts-cpp/qwen3-tts-cpp": read /models/qwen3-tts-cpp: is a directory

stageModelFiles treats every model path field (ModelFile, MMProj, Tokenizer, …) as a single regular file. file_stager_http.go:doUpload does os.Open(localPath) and streams the fd as the HTTP PUT body. Opening a directory succeeds, but reading its fd returns EISDIR ("is a directory").

Models whose parameters.model points at a directory of files hit this. Example: qwen3-tts-cpp ships weights + tokenizer:

/models/qwen3-tts-cpp/qwen3-tts-0.6b-f16.gguf
/models/qwen3-tts-cpp/qwen3-tts-tokenizer-f16.gguf
# qwen3-tts-cpp.yaml
parameters:
    model: qwen3-tts-cpp   # a directory

Single-file models (e.g. parakeet → parakeet-cpp/tdt-0.6b-v3-f16.gguf) are unaffected, so this surfaces specifically on multi-file pipelines — notably the realtime TTS stage, which fails to load on any worker that doesn't already have the files locally.

Fix

stageModelFiles now detects when a path field is a directory and stages each contained file individually via a new stageDirectory helper:

  • filepath.WalkDir the directory, EnsureRemote each file using the existing StagingKeyMapper (structure-preserving keys, so files land under one tracking-key dir on the worker).
  • Rewrite the field to the remote directory path and derive ModelPath as before, so the backend's ModelFile resolves correctly on the worker.
  • countStageableFiles expands directories when computing the staging progress total, so the tracker doesn't exceed 100%.

Single-file staging is unchanged.

Tests

  • New router_dirstage_test.go: a directory ModelFile stages each contained file (not the directory path). Verified RED before the fix, GREEN after.
  • Full core/services/nodes suite passes (ok ... 160s).

@mudler mudler force-pushed the fix/distributed-stage-directory-models branch from 4a11458 to 3b6c694 Compare June 4, 2026 15:30
Distributed file-staging treated every model path field (ModelFile, etc.)
as a single regular file: it os.Open'd the path and streamed its fd as the
HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose
weights and tokenizer ggufs live under one directory referenced by
parameters.model — opening the directory succeeds but reading its fd
returns EISDIR, so routing the model to a remote NATS worker failed with
"read /models/<model>: is a directory". Single-file models were unaffected,
so only multi-file pipelines (e.g. the realtime TTS stage) broke.

stageModelFiles now detects a directory path field and stages each
contained file individually (via the new stageDirectory helper), preserving
structure with the existing StagingKeyMapper and rewriting the field to the
remote directory (deriving ModelPath as before). countStageableFiles makes
the progress total count a directory's files so the staging tracker stays
accurate.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler force-pushed the fix/distributed-stage-directory-models branch from 3b6c694 to 8d9e983 Compare June 4, 2026 15:37
@mudler mudler merged commit 92726f7 into master Jun 4, 2026
58 checks passed
@mudler mudler deleted the fix/distributed-stage-directory-models branch June 4, 2026 16:05
@localai-bot localai-bot added the bug Something isn't working label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants