feat: 添加响应超时重试配置#1804
Conversation
Greptile SummaryThis PR adds two configurable timeout fields to the retry policy —
Confidence Score: 5/5Safe to merge; only minor design observations found, no incorrect runtime behavior in the changed paths. The timeout guard's atomic state machine is well-structured, nil-guard paths are handled throughout, and both timeout types correctly propagate as distinct sentinel errors that the retry loop can identify. The two observations — retry delay not being skipped for timeout-triggered switches and an unreachable else-if branch — do not affect correctness. llm/pipeline/stream.go and llm/pipeline/pipeline.go are where the non-trivial concurrency and retry logic lives; everything else is straightforward plumbing. Important Files Changed
Sequence DiagramsequenceDiagram
participant O as Orchestrator
participant P as pipeline.Process
participant PR as processRequest
participant S as stream()
participant G as firstEventTimeoutGuard
participant E as Executor.DoStream
participant N as nextLlmStreamEvent
O->>P: Process(ctx, request)
P->>PR: processRequest(ctx, llmRequest)
alt Streaming request
PR->>S: stream(ctx, executor, req, streamFirstEventTimeout)
S->>G: newFirstEventTimeoutGuard(ctx, timeout)
G-->>S: streamCtx, guard
S->>E: DoStream(streamCtx, req)
E-->>S: outboundStream
Note over G: timer fires, CAS pending to timedOut, cancel(streamCtx)
S->>N: "nextLlmStreamEvent(ctx, llmStream, firstEvent=true, guard)"
N->>N: llmStream.Next() blocked until streamCtx cancelled or event arrives
alt Event arrives first
N->>G: acceptFirstEvent() CAS pending to completed, stop timer
N-->>S: (true, nil)
else Timeout fires first
N->>G: "completeFirstEventPhase() CAS fails state=timedOut"
N->>G: timedOut() returns true
N-->>S: (false, ErrStreamFirstEventTimeout)
end
S-->>PR: stream or ErrStreamFirstEventTimeout
else Non-streaming request
PR->>PR: withNonStreamTimeout(ctx) returns timeoutCtx
PR->>PR: notStream or autoAggregateStream with timeoutCtx
Note over PR: context.WithTimeoutCause sets cause=ErrNonStreamResponseTimeout on expiry
PR->>PR: isNonStreamTimeout(timeoutCtx) via context.Cause
PR-->>P: response or ErrNonStreamResponseTimeout
end
alt isResponseTimeoutError(lastErr)
P->>P: skip same-channel retry
P->>P: NextChannel() channel switch
else Normal error
P->>P: CanRetry() same-channel retry first
P->>P: NextChannel() if same-channel exhausted
end
P->>P: time.Sleep(retryDelay) applied to all retries including timeout
Reviews (6): Last reviewed commit: "Merge pull request #5 from xuyufengfei-c..." | Re-trigger Greptile |
…view-fixes fix(重试): 处理响应超时审查反馈
…view-fixes fix(重试): 消除流式首字超时竞态
refactor(重试): 重命名流式预读方法
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…imeout-race fix(重试): 保留首字超时获胜状态
* feat(重试): 添加响应超时重试配置 * fix(重试): 覆盖无响应上游超时 * fix(重试): 区分父级非流超时 * fix(重试): 处理超时审查反馈 * fix(重试): 消除流式首字超时竞态 * refactor(重试): 重命名流式预读方法 * Update llm/pipeline/stream.go Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix(重试): 保留首字超时获胜状态 * test(重试): 移除首字超时回归测试 --------- Co-authored-by: xuyufengfei-cyber <xuyufengfei-cyber@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
在重试设置中增加流式首字超时、非流式响应超时配置,0 表示关闭。
流式首字超时覆盖 DoStream 建流阶段和首个 LLM event 预读阶段,上游服务器宕机/卡住且完全不返回数据时会触发超时错误。 非流式超时覆盖请求调用本身,上游无响应会归类为非流式响应超时错误。
超时重试跳过同渠道最大重试次数限制,直接按负载均衡策略切换下一个渠道。