[Fix] Warm provider auth off main thread#86281
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 27, 2026, 6:22 PM ET / 22:22 UTC. Summary PR surface: Source +522, Tests +355, Other +6. Total +883 across 21 files. Reproducibility: yes. The linked issue logs show 58-90s provider-auth prewarm stalls, and the PR discussion includes Testbox before/after proof where current main delayed a 100 ms timer to about 34.1s while the PR kept it at 100 ms. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the worker-based root fix once maintainers accept the auth-cache and packaging risk, keeping fallback computation intact and preserving the release-check guard for the new worker artifact. Do we have a high-confidence way to reproduce the issue? Yes. The linked issue logs show 58-90s provider-auth prewarm stalls, and the PR discussion includes Testbox before/after proof where current main delayed a 100 ms timer to about 34.1s while the PR kept it at 100 ms. Is this the best way to solve the issue? Yes, with maintainer risk acceptance. Moving the expensive warm sweep to a worker while keeping fallback computation is the narrowest root fix I saw; an operator config switch would be only a mitigation. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against fb1dfd486bb9. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +522, Tests +355, Other +6. Total +883 across 21 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🌱 uncommon Tiny Merge Sprite Hatch commandComment Hatchability rules:
Rarity: 🌱 uncommon. What is this egg doing here?
|
82ffc1b to
077ab2e
Compare
077ab2e to
bfb08ba
Compare
91c57a0 to
624ff7d
Compare
Signed-off-by: samzong <samzong.lu@gmail.com>
c645e61 to
8c3e3d4
Compare
|
Verification before merge: Behavior addressed: provider-auth warm now runs off the main thread while preserving cancellation/stale-generation guards and read-only auth-state semantics.
|
Summary
Motivation
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
distoutput on macOS with Node v24.14.0.provider auth state pre-warmed in 58655ms eventLoopMax=36540.8msandprovider auth state pre-warmed in 67840ms eventLoopMax=36876.3mson 2026.5.22.Root Cause (if applicable)
warmCurrentProviderAuthStateran the catalog/provider/agent auth discovery sweep on the gateway process main thread after gateway ready. On constrained hosts, external auth discovery and store hydration could stall the event loop long enough to break channel startup timers.Regression Test Plan (if applicable)
src/agents/model-provider-auth.test.ts,src/agents/model-provider-auth.worker.test.ts,src/gateway/server-startup-post-attach.test.ts,src/gateway/server-reload-handlers.test.ts,src/gateway/server-methods/models-auth-status.test.ts,src/gateway/server.reload.test.ts,src/infra/tsdown-config.test.ts,test/release-check.test.ts.User-visible / Behavior Changes
Gateway startup, reload, and logout-triggered provider-auth rewarm should no longer block channel timers on the main event loop. Auth answers and config semantics remain unchanged.
Diagram (if applicable)
Security Impact (required)
Yes/No) NoYes/No) YesYes/No) NoYes/No) NoYes/No) NoYes, explain risk + mitigation: runtime auth-store snapshots passed to the worker are reduced to provider-presence-only placeholder credentials, so the worker can preserve auth availability semantics without copying real token material.Repro + Verification
Environment
distSteps
Expected
Actual
node scripts/tsdown-build.mjs --no-cleancompleted successfully.ok-default.ok model-provider-auth-OLG_yXSX.js.NO_FINDINGS.Evidence
Human Verification (required)
What you personally verified (not just CI), and how:
Review Conversations
Compatibility / Migration
Yes/No) YesYes/No) NoYes/No) NoRisks and Mitigations
hasAuthForModelProviderkeeps the existing compute fallback when no prepared snapshot matches.