Skip to content

fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...)#10238

Merged
mudler merged 1 commit into
masterfrom
fix/distributed-inflight-untracked-methods
Jun 10, 2026
Merged

fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...)#10238
mudler merged 1 commit into
masterfrom
fix/distributed-inflight-untracked-methods

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Problem

In distributed mode, the in-flight counter for some models never returns to 0 even when idle. Reported with silero-vad: its in-flight count stays pinned at 1 forever after the model is loaded, which also blocks the router's idle-eviction logic from ever unloading it.

Root cause

InFlightTrackingClient (core/services/nodes/inflight.go) wraps grpc.Backend to increment/decrement the registry's in-flight counter around each inference call. It only overrode a subset of the inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect, Rerank, ...). Methods like VAD were left as embedded passthrough, so track() never ran for them.

Crucially, every model is loaded with in_flight = 1 as a reservation (router.go, scheduleAndLoad(..., 1)). That reservation is only released by the OnFirstComplete callback, which fires after the first tracked inference call completes. A VAD-only model (silero-vad) never calls a tracked method → the reservation is never released → in-flight is stuck at 1.

The bug applies to every unwrapped unary inference method, not just VAD: Diarize, FaceVerify, FaceAnalyze, VoiceVerify, VoiceAnalyze, VoiceEmbed, TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform. (Models that also expose a tracked method got their reservation released on the first such call, masking the leak.)

Fix

Wrap the remaining unary inference methods with the same track() / reconcile() pattern as the existing ones.

The three bidi-stream constructors (AudioTransformStream, AudioToAudioStream, Forward) are deliberately left as passthrough — their inference spans the stream lifetime, not the constructor call, so wrapping the constructor would fire onFirstComplete and decrement before any data flows. Documented inline.

Tests

Added 12 specs in inflight_test.go asserting each method increments, decrements, and releases the load-time reservation (onFirstComplete). Red before the fix, green after. Full core/services/nodes suite passes.

🤖 Generated with Claude Code

InFlightTrackingClient only wrapped a subset of the grpc.Backend
inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect,
Rerank, ...). Methods like VAD were left as embedded passthrough, so
track() never ran for them.

In distributed mode every model is loaded with in_flight=1 as a
reservation; that reservation is only released by the OnFirstComplete
callback, which fires after the first *tracked* inference call completes.
A VAD-only model (e.g. silero-vad) never calls a tracked method, so the
reservation is never released and in-flight stays pinned at 1 forever -
which also blocks the router's idle-eviction logic.

Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*,
TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the
same track()/reconcile() pattern. The three bidi-stream constructors
(AudioTransformStream, AudioToAudioStream, Forward) are deliberately left
as passthrough - their inference spans the stream lifetime, not the
constructor call, so track() there would fire onFirstComplete before any
data flows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit fba8c9c into master Jun 10, 2026
56 of 57 checks passed
@mudler mudler deleted the fix/distributed-inflight-untracked-methods branch June 10, 2026 14:29
@localai-bot localai-bot added the bug Something isn't working label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants