fix(distributed): stage directory-based models to remote nodes#10175
Merged
Conversation
4a11458 to
3b6c694
Compare
Distributed file-staging treated every model path field (ModelFile, etc.) as a single regular file: it os.Open'd the path and streamed its fd as the HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose weights and tokenizer ggufs live under one directory referenced by parameters.model — opening the directory succeeds but reading its fd returns EISDIR, so routing the model to a remote NATS worker failed with "read /models/<model>: is a directory". Single-file models were unaffected, so only multi-file pipelines (e.g. the realtime TTS stage) broke. stageModelFiles now detects a directory path field and stages each contained file individually (via the new stageDirectory helper), preserving structure with the existing StagingKeyMapper and rewriting the field to the remote directory (deriving ModelPath as before). countStageableFiles makes the progress total count a directory's files so the staging tracker stays accurate. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
3b6c694 to
8d9e983
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In distributed mode, routing a directory-based model to a remote NATS worker fails with:
stageModelFilestreats every model path field (ModelFile,MMProj,Tokenizer, …) as a single regular file.file_stager_http.go:doUploaddoesos.Open(localPath)and streams the fd as the HTTP PUT body. Opening a directory succeeds, but reading its fd returnsEISDIR("is a directory").Models whose
parameters.modelpoints at a directory of files hit this. Example:qwen3-tts-cppships weights + tokenizer:Single-file models (e.g.
parakeet → parakeet-cpp/tdt-0.6b-v3-f16.gguf) are unaffected, so this surfaces specifically on multi-file pipelines — notably the realtime TTS stage, which fails to load on any worker that doesn't already have the files locally.Fix
stageModelFilesnow detects when a path field is a directory and stages each contained file individually via a newstageDirectoryhelper:filepath.WalkDirthe directory,EnsureRemoteeach file using the existingStagingKeyMapper(structure-preserving keys, so files land under one tracking-key dir on the worker).ModelPathas before, so the backend'sModelFileresolves correctly on the worker.countStageableFilesexpands directories when computing the staging progress total, so the tracker doesn't exceed 100%.Single-file staging is unchanged.
Tests
router_dirstage_test.go: a directoryModelFilestages each contained file (not the directory path). Verified RED before the fix, GREEN after.core/services/nodessuite passes (ok ... 160s).