Skip to content

feat(quota): add provider quota enforcement with per-limit status tracking#1518

Merged
looplj merged 18 commits into
looplj:unstablefrom
djdembeck:feature/quota-enforcement
Apr 28, 2026
Merged

feat(quota): add provider quota enforcement with per-limit status tracking#1518
looplj merged 18 commits into
looplj:unstablefrom
djdembeck:feature/quota-enforcement

Conversation

@djdembeck

@djdembeck djdembeck commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Add quota enforcement infrastructure that monitors provider quota status and integrates it into the request routing pipeline. Channels whose providers report exhausted quotas are filtered from candidate selection (exhausted-only mode) or deprioritized by the load balancer (de-prioritize mode). Per-limit-type status tracking enables granular enforcement — e.g., routing token requests to a channel whose image limit is exhausted.

Spirit/Intent

Prevent wasted requests and improve reliability by automatically detecting provider quota exhaustion and steering traffic away from exhausted channels before errors reach the user.

Key Changes

  • Per-limit quota status: Each provider checker now reports per-limit-type entries (token, image) with individual status, usage ratio, and readiness — enabling modality-aware routing decisions
  • Enforcement modes: Exhausted-only mode filters exhausted channels from candidates; De-prioritize mode keeps them as candidates but penalizes them in the load balancer, only failing the request if every channel is exhausted
  • Quota-aware load balancing: Load balancer scores channels based on per-limit effective status, with penalty scaling proportional to usage ratio
  • Candidate filtering: Exhausted channels are removed from candidates in exhausted-only mode; in de-prioritize mode they remain candidates but receive heavy penalties
  • Request modality propagation: The request type (image vs. token) is carried through the routing pipeline so quota decisions use the correct per-limit status
  • API error responses: When all channels are quota-depleted for a model, the API returns HTTP 503 with a structured quota-exhausted error; this applies to chat completion, streaming, doubao, and playground endpoints
  • Frontend settings: Quota enforcement toggle and mode selector in System Settings, with GraphQL schema and resolvers
  • Config: provider_quota.warning_check_interval_ratio controls how often warning-state channels are rechecked (default: 4× normal interval)
  • Checker updates: All 7 provider checkers (ClaudeCode, Codex, GitHub Copilot, NanoGPT, NeuralWatt, Synthetic, Wafer) updated to emit per-limit statuses with correct usage ratios
  • Tests: 1500+ lines of new tests covering quota cache, channel status, candidate filtering, load balancer scoring, API error handling, and per-limit routing

Risks

  • Cache initialization loads all quota records synchronously on startup; may delay service start with large datasets

- Add QuotaEnforcementSettings to system service with enabled/mode config
- Implement ProviderQuotaSelector to filter exhausted channels at selection time
- Add QuotaAwareStrategy applying -10000 penalty for exhausted channels
- Wire quota-aware components into all load balancers via fx
- Add GraphQL query/mutation for quota enforcement settings control
- Return HTTP 503 with quota_exhausted error code in API handlers
- Add in-memory quota status cache using sync.Map for O(1) lookups
- Add comprehensive tests for quota selector and scoring strategy
[R1] Only return QuotaExhaustedError when quota filtering caused empty
candidates (ProviderQuotaSelector.FilteredCount > 0), preventing wrong
HTTP 503 for non-quota failures like model-not-found.

[D1] Extract writeQuotaExhaustedResponse helper, replacing 5 identical
copy-pasted quota error blocks across chat.go, doubao.go, openai.go.

[D2] Remove unreachable quota-exhausted checks after ReadHTTPRequest
in doubao.go and openai.go (ReadHTTPRequest only returns io.ReadAll
errors, never QuotaExhaustedError).

[R5] Rewrite quota test to exercise actual production code
(writeQuotaExhaustedResponse) instead of manually reimplementing
handler logic that would pass even if handler broke.

[S1] Define QuotaEnforcementMode typed constants in backend
(ExhaustedOnly, DePrioritize), replacing 40+ raw string literals
across system.go, orchestrator, resolvers, and tests.

[S1] Type QuotaChannelStatus.Status as providerquotastatus.Status
instead of bare string, reusing existing typed enum.

[S2] Extract hardcoded 0.8 warning usage ratio into named constant
(warningUsageRatio) in lb_strategy_quota.go.

[S1] Define QuotaEnforcementMode union type in frontend (TypeScript),
applied to QuotaEnforcementSettings and UpdateQuotaEnforcementSettingsInput.

[S3] Remove dead 'exhausted_only' fallback and 'value &&' guard in
quota-settings.tsx Select component.
Negate de-prioritize score so warning channels rank below exhausted
ones. Add validation rejecting invalid quota enforcement modes.

Fix playground error handling: return 503 for quota-exhausted errors
instead of falling through to generic HTTP error handling. Update
quota enforcement settings resolver to use the non-default variant
so errors surface instead of silently returning zero values.

Clean up stale .playwright-cli/ test artifacts and add to .gitignore.
Populate per-limit QuotaLimitStatus in all 7 checkers based on request modality
(req.Image != nil for image vs token types). Add NextCheckAt =
checkInterval/4 for warning channels. Propagate quota limit type via
context keys. Score and filter providers per-limit using per-limit
effective status evaluation.
QuotaEnforcementMode was missing MarshalGQL causing gqlgen runtime type
assertion error when serializing the mode field. Converted mode from
GraphQL String to proper enum type with MarshalGQL/UnmarshalGQL methods
and updated frontend to use SCREAMING_SNAKE_CASE values.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a system-wide quota enforcement mechanism, allowing AxonHub to manage channels with exhausted provider quotas through either strict filtering or de-prioritization. Key additions include a new "Quota" settings UI, modality-specific (image vs. token) quota tracking across various provider checkers, and an in-memory status cache to optimize routing decisions. Feedback focuses on refining the DE_PRIORITIZE mode to ensure exhausted channels remain available as a last resort, correcting status ranking logic, optimizing service startup by loading the quota cache asynchronously, and maintaining modality-specific limits during error scenarios.

Comment thread internal/server/orchestrator/candidates_quota.go Outdated
Comment thread internal/server/biz/provider_quota.go Outdated
Comment thread internal/server/biz/provider_quota.go Outdated
Comment thread internal/server/biz/provider_quota.go Outdated
@djdembeck djdembeck marked this pull request as draft April 28, 2026 00:58
@djdembeck djdembeck changed the title fix: resolve 14 correctness bugs in quota enforcement feature feat(quota): add provider quota enforcement with per-limit status tracking Apr 28, 2026
…istency

- Fix inverted warning penalty formula in QuotaAwareStrategy (1-usageRatio → usageRatio)
- Fix inverted UsageRatio in NeuralWatt checker (remaining → used fraction)
- Fix usageRatio=0.0 when exhausted in GitHub Copilot checker (default to 1.0)
- Fix PercentUsed≥1.0 mapping to warning instead of exhausted in NanoGPT checker
- Fix UsageRatio=0 when Codex limit exhausted without UsedPercent
- Fix saveQuotaError wiping per-limit cache on transient failures
- Wire warning_check_interval_ratio config through DI instead of hardcoding /4
- Fix EffectiveStatus returning Available for all-Unknown limits
- Return error from UnmarshalJSON on invalid QuotaEnforcementMode
- Replace string literals with typed status constants in orchestrator
- Remove dead IsNotFound check in QuotaEnforcementSettingsOrDefault
- Add per-limit exhausted/edge-case tests for NanoGPT
- Add multi-ratio coverage and penalty-increase test for LB strategy
- Add cache round-trip test for Limits and EffectiveStatus unknown test
…t EffectiveStatus init

- ProviderQuotaSelector: skip filtering in DePrioritize mode so
  QuotaAwareStrategy can penalize exhausted channels instead of
  removing them, preserving the soft enforcement semantic
- EffectiveStatus: initialize worstStatus to StatusUnknown and
  worstReady to false so that Unknown-ranked limits can be
  overridden by any known status instead of returning Available
  incorrectly
- Start: run loadQuotaCache async to avoid blocking startup on
  large quota datasets
- Update tests to expect all candidates in DePrioritize mode
Reduce the blind spot between quota checks so that channels in warning
state are detected sooner. At 20m, a Codex channel at 90% usage on the
3h primary window could fully exhaust before the next check. At 5m,
the worst-case blind spot is under 30% of remaining quota even for the
shortest provider windows.

The adaptive warning interval (base/4) now checks every ~75s instead of
5m, catching rapid exhaustion quickly once warning is first detected.

Also enable refetchIntervalInBackground on the frontend quota query so
the UI stays current when the browser tab is backgrounded.
Add channel-level EffectiveStatus floor when base is exhausted
- Fallback to "unknown" for non-standard status values for consistency
- Lower synthetic warning boundary to prevent false 429 selector picks
- Refactor quotaSelector for correct composition order
- Add UnmarshalGQL for EffectiveStatus type consistency
- Add round-trip test for quota selection logic
- Add config documentation for quota warning interval ratio
Refactor API-specific error formatting for quota exhausted
errors (extract wrapQuotaExhaustedAsResponseError shared helper).

Fix provider_quota warning interval math (multiply vs divide ratio).
Add DE_PRIORITIZE exhaustion detection in candidate selection.
Propagate DB errors in UpdateQuotaEnforcementSettings resolver.
Add usage ratio to wafer quota limits.
Extend synthetic checker test coverage for limits array.
Fix comment: de_prioritize description.
Replace magic numbers (0.8, quota_exhausted) with named constants.
Extract NewTokenLimitStatus() and IsReadyStatus() helpers to deduplicate
quota data construction across 6 provider checkers. Tighten
nextCheckIntervalForStatus() to use providerquotastatus.Status instead
of string. Simplify quota selector wiring in selectCandidates().
Restore doc comments and inline field comments on QuotaChecker interface, QuotaData struct fields, and NanoGPT checker types/functions that existed on unstable but were dropped during the provider_quota package restructure.
Restore doc comments on buildNanoGPTQuotaURL and findEarliestResetAt, plus inline comments on warning state check, nextResetAt calculation, and grace period fallback that were lost during the provider_quota restructure.
@djdembeck djdembeck marked this pull request as ready for review April 28, 2026 06:05
@greptile-apps

greptile-apps Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a comprehensive provider quota enforcement system that monitors per-limit quota status (token vs. image) across 7 provider checkers and integrates it into the request-routing pipeline via two enforcement modes: ExhaustedOnly (hard-filter exhausted channels) and DePrioritize (score penalty in the load balancer). The implementation is large but well-structured, with 1500+ lines of tests covering quota cache, candidate filtering, load balancer scoring, and API error responses.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/polish issues with no impact on correctness or data integrity.

The feature is well-architected and thoroughly tested (1500+ new test lines). All P0/P1 concerns raised in prior threads have been addressed. The two new findings are P2: a JSON tag/map-key inconsistency that creates no runtime bug under current usage, and an edge-case penalty floor in the warning-scoring path that is unlikely to matter in practice. No data loss, security, or routing correctness issues were found.

internal/server/biz/provider_quota/types.go (JSON tag alignment), internal/server/orchestrator/lb_strategy_quota.go (zero-ratio warning penalty)

Important Files Changed

Filename Overview
internal/server/biz/provider_quota.go Core quota cache service with QuotaChannelStatus, EffectiveStatus, in-memory sync.Map cache, and roundtrip merge/extract helpers for per-limit data; startup cache load uses the lifecycle context (pre-existing concern noted in threads)
internal/server/orchestrator/candidates_quota.go New ProviderQuotaSelector wraps the candidate chain and filters exhausted channels in ExhaustedOnly mode; correctly passes through all candidates in DePrioritize mode and exposes FilteredCount for the outer error check
internal/server/orchestrator/lb_strategy_quota.go New QuotaAwareStrategy scores channels using EffectiveStatus and scaleScore; exhausted gets -10000 penalty, warning gets negative penalty proportional to usage ratio; correctly reads limit type from context
internal/server/orchestrator/select_candidates.go Wires ProviderQuotaSelector into the candidate chain and adds post-selection checks for both ExhaustedOnly (FilteredCount) and DePrioritize (areAllChannelsExhausted) modes to emit QuotaExhaustedError
internal/server/biz/provider_quota/types.go Adds QuotaLimitStatus, QuotaLimitType, RequestModality, and helper constants; JSON struct tags are snake_case but the DB serialisation path uses camelCase map keys — minor inconsistency
internal/server/api/chat.go Adds wrapQuotaExhaustedAsResponseError for non-streaming paths and handles QuotaExhaustedError in FormatStreamError for streaming paths; 503 response code is correct
internal/server/biz/system.go Adds QuotaEnforcementSettings with proper GQL and JSON bidirectional serialisation for both uppercase (GQL) and lowercase (JSON) mode values; default is disabled with ExhaustedOnly mode
internal/server/orchestrator/orchestrator.go Threads quotaProvider into ChatCompletionOrchestrator and adds QuotaAwareStrategy to all three load balancers (adaptive, failover, circuit-breaker)

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Incoming Request] --> B[selectCandidates middleware]
    B --> C[Base selector: model/stream/policy filters]
    C --> D{ProviderQuotaSelector}
    D --> E{Enforcement enabled?}
    E -- No --> F[All candidates pass through]
    E -- Yes --> G{Mode?}
    G -- ExhaustedOnly --> H[Filter: EffectiveStatus == Exhausted for request limit type]
    H --> I{candidates empty?}
    I -- Yes + FilteredCount > 0 --> J[503 QuotaExhaustedError]
    I -- No --> K[LoadBalancedSelector]
    G -- DePrioritize --> L[All candidates pass through]
    L --> K
    K --> M[QuotaAwareStrategy scores each channel]
    M --> N{effectiveStatus?}
    N -- Available --> O[score: 0]
    N -- Unknown --> P[score: 0]
    N -- Warning + DePrioritize --> Q[score: -scaleScore x usageRatio]
    N -- Exhausted --> R[score: -10000]
    O & P & Q & R --> S[Combined sorted candidates]
    S --> T{DePrioritize mode? All channels exhausted?}
    T -- Yes --> J
    T -- No --> U[Store candidates, proceed to routing]
Loading

Reviews (3): Last reviewed commit: "fix: use request context instead of Back..." | Re-trigger Greptile

Comment thread internal/server/orchestrator/select_candidates.go Outdated
Comment thread internal/server/orchestrator/select_candidates.go Outdated
Comment thread internal/server/biz/provider_quota/synthetic_checker.go
Comment thread internal/server/biz/provider_quota.go
Comment thread internal/server/biz/provider_quota/types.go
- Use context.Background() in Start() goroutine to prevent cancellation
- Extract weeklyTokenLimitStatus() for self-documenting non-exhaustion logic
- Replace dead Ready=exhausted check with Ready=true for weekly limits
- Hoist QuotaEnforcementSettingsOrDefault() outside conditional block
- Remove unnecessary quotaSelector nil guard
- Fix struct field alignment in QuotaLimitStatus and QuotaData
…dges

Rename EXHAUSTED_ONLY enforcement mode to Block Exhausted in UI and locale
files for clarity. Add enforcement effect badges (Blocked/Deprioritized)
to quota dialog rows when a channel's quota is exhausted.

Changes:
- QuotaRow now receives enforcementMode prop and renders Blocked or
  Deprioritized badges based on quota status and enforcement setting
- Update EXHAUSTED_ONLY label in en/zh-CN locale files
- Add quota.status.blocked and quota.status.deprioritized keys
- Minor formatting and import cleanup in quota-badges.tsx
@looplj looplj merged commit 1687c56 into looplj:unstable Apr 28, 2026
4 checks passed
@djdembeck djdembeck deleted the feature/quota-enforcement branch April 28, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants