fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...)#10238
Merged
Merged
Conversation
InFlightTrackingClient only wrapped a subset of the grpc.Backend inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect, Rerank, ...). Methods like VAD were left as embedded passthrough, so track() never ran for them. In distributed mode every model is loaded with in_flight=1 as a reservation; that reservation is only released by the OnFirstComplete callback, which fires after the first *tracked* inference call completes. A VAD-only model (e.g. silero-vad) never calls a tracked method, so the reservation is never released and in-flight stays pinned at 1 forever - which also blocks the router's idle-eviction logic. Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*, TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the same track()/reconcile() pattern. The three bidi-stream constructors (AudioTransformStream, AudioToAudioStream, Forward) are deliberately left as passthrough - their inference spans the stream lifetime, not the constructor call, so track() there would fire onFirstComplete before any data flows. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In distributed mode, the in-flight counter for some models never returns to 0 even when idle. Reported with silero-vad: its in-flight count stays pinned at 1 forever after the model is loaded, which also blocks the router's idle-eviction logic from ever unloading it.
Root cause
InFlightTrackingClient(core/services/nodes/inflight.go) wrapsgrpc.Backendto increment/decrement the registry's in-flight counter around each inference call. It only overrode a subset of the inference methods (Predict,Embeddings,TTS,AudioTranscription,Detect,Rerank, ...). Methods likeVADwere left as embedded passthrough, sotrack()never ran for them.Crucially, every model is loaded with
in_flight = 1as a reservation (router.go,scheduleAndLoad(..., 1)). That reservation is only released by theOnFirstCompletecallback, which fires after the first tracked inference call completes. A VAD-only model (silero-vad) never calls a tracked method → the reservation is never released → in-flight is stuck at 1.The bug applies to every unwrapped unary inference method, not just VAD:
Diarize,FaceVerify,FaceAnalyze,VoiceVerify,VoiceAnalyze,VoiceEmbed,TokenClassify,Score,AudioEncode,AudioDecode,AudioTransform. (Models that also expose a tracked method got their reservation released on the first such call, masking the leak.)Fix
Wrap the remaining unary inference methods with the same
track()/reconcile()pattern as the existing ones.The three bidi-stream constructors (
AudioTransformStream,AudioToAudioStream,Forward) are deliberately left as passthrough — their inference spans the stream lifetime, not the constructor call, so wrapping the constructor would fireonFirstCompleteand decrement before any data flows. Documented inline.Tests
Added 12 specs in
inflight_test.goasserting each method increments, decrements, and releases the load-time reservation (onFirstComplete). Red before the fix, green after. Fullcore/services/nodessuite passes.🤖 Generated with Claude Code