perf(streaming): fast-path unchanged chat streams#171
perf(streaming): fast-path unchanged chat streams#171SantiagoDePolonia merged 3 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe changes introduce a fast-path optimization for streaming chat completions in OpenAI-compatible providers, allowing eligible requests to bypass translation and pass through directly, while adding corresponding test coverage for both passthrough and non-passthrough streaming scenarios. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Handler as ChatCompletion Handler
participant FastPath as Fast-Path Check
participant Passthrough as Passthrough Service
participant Translation as Stream Translation Path
participant Provider as Provider
Client->>Handler: Stream ChatCompletion Request
Handler->>FastPath: canFastPathStreamingChatPassthrough?
alt Fast-Path Eligible
FastPath-->>Handler: true
Handler->>Passthrough: tryFastPathStreamingChatPassthrough
Passthrough->>Provider: Passthrough (raw HTTP body)
Provider-->>Passthrough: SSE Stream Response
Passthrough->>Handler: proxyPassthroughResponse
Handler-->>Client: Streamed Response (passthrough)
else Fast-Path Not Eligible
FastPath-->>Handler: false
Handler->>Translation: handleStreamingResponse
Translation->>Provider: StreamChatCompletion (translated)
Provider-->>Translation: Translated SSE Stream
Translation->>Handler: Streamed Response
Handler-->>Client: Streamed Response (translated)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/server/translated_inference_service.go`:
- Around line 141-147: proxyPassthroughResponse called from ChatCompletion
currently returns raw stream I/O errors (e.g., from flushStream/copy) which can
leak non-core.GatewayError values; change the call site in ChatCompletion to
mirror handleStreamingResponse’s behavior: invoke
passthrough.proxyPassthroughResponse(...), but do not bubble raw errors — if the
passthrough returns a stream I/O error after response commit, log it via
s.logger/s.usageLogger and convert/suppress it by returning nil (or wrap it as a
core.GatewayError if you must surface an error), ensuring only core.GatewayError
instances are returned to clients; reference
passthroughService.proxyPassthroughResponse, handleStreamingResponse, and
core.GatewayError when making the change.
- Around line 125-129: The fast-path is forwarding c.Request().Body after
canonicalJSONRequestFromSemantics called requestBodyBytes(), which in the
snapshot case returns snapshot.CapturedBodyView() without rehydrating
c.Request().Body, so rebuild the request body from the canonical bytes before
calling passthroughProvider.Passthrough (in translated_inference_service.go
where Passthrough is invoked); specifically, when requestBodyBytes or
canonicalJSONRequestFromSemantics returns snapshot bytes, replace or rehydrate
c.Request().Body with a new readable body constructed from those bytes (or have
requestBodyBytes perform the rehydration) so Passthrough receives a valid,
non-consumed Body.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6673c86b-d321-4bf0-a585-78a093bbb48f
📒 Files selected for processing (2)
internal/server/handlers_test.gointernal/server/translated_inference_service.go
| resp, err := passthroughProvider.Passthrough(ctx, providerType, &core.PassthroughRequest{ | ||
| Method: c.Request().Method, | ||
| Endpoint: endpoint, | ||
| Body: c.Request().Body, | ||
| Headers: buildPassthroughHeaders(ctx, c.Request().Header), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== canonicalJSONRequestFromSemantics implementation =="
rg -n -C4 'func canonicalJSONRequestFromSemantics|canonicalJSONRequestFromSemantics\[' internal
echo
echo "== request-snapshot helpers and any body rehydration =="
rg -n -C3 'RequestSnapshot|WithRequestSnapshot|Body\s*=.*(io\.NopCloser|bytes\.NewReader)|SetRequest\(' internal
echo
echo "== fast-path call site =="
sed -n '120,130p' internal/server/translated_inference_service.goRepository: ENTERPILOT/GOModel
Length of output: 50374
🏁 Script executed:
# Get the complete semanticJSONBody and lookupOrCaptureSemantics implementations
head -60 internal/server/semantic_requests.goRepository: ENTERPILOT/GOModel
Length of output: 1497
🏁 Script executed:
# Get the requestBodyBytes function from semantic_requests.go
rg -n -A 20 'func requestBodyBytes' internal/server/semantic_requests.goRepository: ENTERPILOT/GOModel
Length of output: 44
🏁 Script executed:
# Show lines 115-135 of translated_inference_service.go to see complete fast-path context
sed -n '115,135p' internal/server/translated_inference_service.goRepository: ENTERPILOT/GOModel
Length of output: 684
🏁 Script executed:
# Find where requestBodyBytes is defined
rg -n 'func requestBodyBytes' internal/server/Repository: ENTERPILOT/GOModel
Length of output: 158
🏁 Script executed:
# Get full requestBodyBytes implementation from semantic_requests.go
rg -n -A 25 'func requestBodyBytes' internal/server/semantic_requests.goRepository: ENTERPILOT/GOModel
Length of output: 44
🏁 Script executed:
# Get the full requestBodyBytes function from request_snapshot.go
sed -n '151,172p' internal/server/request_snapshot.goRepository: ENTERPILOT/GOModel
Length of output: 558
Body rehydration missing for ingress-snapshot fast-path.
When canonicalJSONRequestFromSemantics is called before the fast-path, it invokes requestBodyBytes() which detects the snapshot case and returns snapshot.CapturedBodyView() bytes without rehydrating c.Request().Body. The fast-path then tries to forward c.Request().Body, which remains the original consumed stream, causing the request to fail.
For ingress-managed endpoints that use snapshots, requestBodyBytes() must rehydrate the request body even in the snapshot case, or the fast-path must explicitly reconstruct the body from the canonical bytes before calling Passthrough.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/server/translated_inference_service.go` around lines 125 - 129, The
fast-path is forwarding c.Request().Body after canonicalJSONRequestFromSemantics
called requestBodyBytes(), which in the snapshot case returns
snapshot.CapturedBodyView() without rehydrating c.Request().Body, so rebuild the
request body from the canonical bytes before calling
passthroughProvider.Passthrough (in translated_inference_service.go where
Passthrough is invoked); specifically, when requestBodyBytes or
canonicalJSONRequestFromSemantics returns snapshot bytes, replace or rehydrate
c.Request().Body with a new readable body constructed from those bytes (or have
requestBodyBytes perform the rehydration) so Passthrough receives a valid,
non-consumed Body.
| passthrough := passthroughService{ | ||
| provider: s.provider, | ||
| logger: s.logger, | ||
| usageLogger: s.usageLogger, | ||
| pricingResolver: s.pricingResolver, | ||
| } | ||
| return true, passthrough.proxyPassthroughResponse(c, providerType, endpoint, info, resp) |
There was a problem hiding this comment.
Don’t bubble passthrough stream I/O errors out of ChatCompletion.
proxyPassthroughResponse() can return raw flushStream()/copy errors after the response is already committed, while handleStreamingResponse on Lines 269-273 records the failure and returns nil. Reusing it directly here changes /v1/chat/completions behavior on client disconnects and mid-stream transport failures, and leaks non-core.GatewayError handler errors on the hot streaming path. As per coding guidelines, All errors returned to clients must be instances of core.GatewayError.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/server/translated_inference_service.go` around lines 141 - 147,
proxyPassthroughResponse called from ChatCompletion currently returns raw stream
I/O errors (e.g., from flushStream/copy) which can leak non-core.GatewayError
values; change the call site in ChatCompletion to mirror
handleStreamingResponse’s behavior: invoke
passthrough.proxyPassthroughResponse(...), but do not bubble raw errors — if the
passthrough returns a stream I/O error after response commit, log it via
s.logger/s.usageLogger and convert/suppress it by returning nil (or wrap it as a
core.GatewayError if you must surface an error), ensuring only core.GatewayError
instances are returned to clients; reference
passthroughService.proxyPassthroughResponse, handleStreamingResponse, and
core.GatewayError when making the change.
Summary
Testing
git commitBenchmark
Local mock benchmark against the OpenAI-compatible streaming path after this change:
/v1/chat/completionsimproved from3532 req/sand13.76msp50 TTFB to3844 req/sand12.08msp50 TTFB/v1/responsesis unchanged by design in this PRSummary by CodeRabbit
Release Notes
Performance Improvements
Tests