Skip to content

RFC: model-gateway primitives (alloy + cascade + dispatcher)#13

Merged
bglusman merged 4 commits intomainfrom
rfc/model-gateway-primitives
Apr 23, 2026
Merged

RFC: model-gateway primitives (alloy + cascade + dispatcher)#13
bglusman merged 4 commits intomainfrom
rfc/model-gateway-primitives

Conversation

@bglusman
Copy link
Copy Markdown
Owner

Summary

Design doc proposing three clearly-scoped model-gateway primitives plus a shared TokenEstimator trait, in response to the architectural gap surfaced by trying to use alloys for kimi-plus-local hybrids.

Problem: alloys today assume constituents are interchangeable — any could serve any request. That breaks when constituents have wildly different context windows (e.g. local Qwen at 32K + Kimi K2.6 at 262K). A 100K request routed to Qwen silently truncates.

Proposal:

  • Alloy (today, + safety fixes): blend equivalent models via sampling
  • Cascade (promoted to named primitive): try in order, fall through on error (with new size-awareness)
  • Dispatcher (new): pick by request shape; MVP rule type is max_input_tokens, extensible later
  • Shared TokenEstimator trait with configurable chars-per-token default (3.5) + safety margin (10%), per-model overrides, pluggable real-tokenizer backends as future feature flags

Review asks

Specific questions called out in the doc under "Open questions for reviewers":

  1. Naming: dispatcher vs router/tier/fit/selector — I'm going with dispatcher but open to alternatives
  2. Cascade as a named primitive or keep implicit inside alloy fallbacks? My lean: promote
  3. Safety margin default: 10% — too much, too little, right?
  4. Per-primitive tokenizer override: v1 or defer to v2?
  5. Re-evaluation strategy default: sticky vs per_turn vs worst_case for sessions growing past the chosen tier

Scope boundaries

In this PR: design only. No code changes.

Immediate follow-up (task #23): small safety assertion PR — adds context_window + min_context_window fields to alloy config with startup validation. Independent of this RFC's resolution; unblocks alloy safety now.

Subsequent follow-ups if RFC approved: TokenEstimator impl, cascade named primitive, dispatcher impl, README updates.

Test plan

  • No code. Doc review only.
  • Reviewer confirms the naming + open questions
  • Reviewer confirms scope boundaries

🤖 Generated with Claude Code

Proposes three explicitly-scoped primitives for model-gateway routing,
each with distinct semantics, and a shared TokenEstimator trait so they
can reason about context-window fit consistently.

Problem motivating this: alloys today assume constituents are
interchangeable, which breaks when mixing models with wildly different
context windows (e.g., local Qwen at 32K and Kimi K2.6 at 262K). Using
an alloy for size-dependent routing leads to silent truncation.

Proposal:
- Alloy (today, + safety) blends equivalent models via sampling. New:
  min_context_window assertion + runtime fit-check rejection.
- Cascade (promoted to named primitive) tries members in order on error.
  New: skip-on-size when a cascade step can't fit the request.
- Dispatcher (new) picks by request shape, MVP rule type is
  max_input_tokens; extensible to other matchers.

Shared TokenEstimator trait:
- Default: CharRatioEstimator (chars-per-token configurable, 3.5 default)
- Safety margin (default 10%) to bias toward over-estimation and avoid
  silent truncation footgun.
- Pluggable for real tokenizers (tiktoken-rs, sentencepiece) as future
  opt-in crates behind feature flags.
- Configurable globally, per-primitive, and per-model with clear
  precedence.

Includes naming discussion (going with "dispatcher"), migration plan
(no breaking changes), edge cases (session-context growth, tool-use
inflation, streaming output), and open questions for reviewers.

Implementation split into focused follow-up PRs. RFC is the long-form
design; PRs execute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an RFC/design document to define model-gateway routing primitives and token fit-checking semantics for zeroclawed, motivated by hybrid deployments with widely differing context windows.

Changes:

  • Introduces three proposed gateway primitives: Alloy (equivalent-model blending), Cascade (ordered fallback), and Dispatcher (request-shape routing).
  • Proposes a shared TokenEstimator trait with a default char-ratio heuristic and future pluggable tokenizer backends.
  • Documents composition patterns, migration notes, and open questions for reviewers (naming, safety margin, re-evaluation strategy).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +80 to +83
id = "kimi-with-fallback"
# First success wins. Cascading ONLY on error (not on size).
# Caller's responsibility: ensure request fits the NARROWEST member, OR wrap cascade inside a dispatcher.
[[cascades.steps]]
Comment thread docs/rfcs/model-gateway-primitives.md Outdated
```toml
context_window = 262144 # tokens
# equivalently:
context_window = "262K" # parsed as 262 * 1024
Comment on lines +210 to +214
**Rationale for default values:**

- **3.5 chars/token** — common average for English prose with GPT-family tokenizers. Code and Chinese text are denser (~2.5 and ~1.5 respectively). Under-estimating for code/non-English is the *risk case*, hence the safety margin.
- **10% safety margin** — covers most under-estimates without wasting model context.

Comment thread docs/rfcs/model-gateway-primitives.md Outdated
Comment on lines +361 to +365
Output tokens don't count against the input-context budget. Our estimator is *input* estimator only.

### 4. Reasoning tokens (e.g., Kimi K2.6 reasoning mode)

Reasoning tokens add output cost but are usually *not* in the input context. Estimator doesn't need to account for them.
Comment on lines +245 to +270
```toml
[model_defaults]
chars_per_token = 3.5
safety_margin = 1.10

[[models]]
id = "qwen3.5-35b"
context_window = 32768
chars_per_token = 3.0 # Qwen tokenizer tends denser

[[models]]
id = "kimi-k2.6"
context_window = 262144
chars_per_token = 2.8 # Chinese-English mixed, code-heavy
```

### Config surface

**Global default** in top-level config:

```toml
[tokenizer]
kind = "char_ratio" # "char_ratio" | "tiktoken" | "sentencepiece"
chars_per_token = 3.5
safety_margin = 1.10
```
Comment thread docs/rfcs/model-gateway-primitives.md Outdated
- **Dispatcher**: rule targets are either sized (matchable) or "any" (catch-all, always matches last). Unsized rules only make sense as catch-alls.
- **Cascade**: unsized steps always considered eligible (no pre-skip).

Explicit `context_window = 0` means "unknown/unbounded" and skips size-checks for that member.
Comment thread docs/rfcs/model-gateway-primitives.md Outdated

**Purpose:** reliability. Try primary, on timeout/5xx/429 try secondary.

**Today's behavior** (implicit in alloy `fallbacks` array) → **promoted** to its own named primitive for clarity:
Captures reviewer resolutions:
- dispatcher confirmed as the size-routing primitive name
- cascade promoted to a named primitive
- safety margin split into two distinct knobs: estimator safety_margin
  (counting-accuracy pad) and per-model capacity_fraction (avoid the
  quality-degradation zone near a model's ceiling). Composition formula
  and rationale added.
- TiktokenEstimator included in v1 behind a feature flag; SentencePiece
  deferred to avoid C++ deps + per-model vocab plumbing for now
- dispatcher reevaluate default flipped to per_turn (task-completion
  flows benefit from auto-promotion over consistency); sticky and
  sticky_escalate documented as opt-ins
- dispatcher rule semantics simplified: default is "first target whose
  effective ceiling fits the request", computed per-target from
  context_window × capacity_fraction. Explicit rules remain for
  non-size routing.
- alloy context_window required on every constituent. No back-compat
  for missing fields — prototype phase, owned installations, worth the
  one-time config edit to eliminate silent truncation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bglusman added a commit that referenced this pull request Apr 23, 2026
Per RFC #13 review decision (no back-compat in prototype phase): every
alloy constituent must declare its context_window. Serde enforces the
field at config load, so misconfigured alloys fail with a clear parse
error instead of quietly accepting "unknown" sizes that could mask
routing bugs later.

## Config changes

AlloyConstituentConfig.context_window is now a required u32 (was an
optional field in the previous draft). Every existing config must be
updated to declare sizes; live config on .210 already patched in the
matching commit on infra.

AlloyConfig.min_context_window remains optional; when unset, it's
auto-computed as min(constituent.context_window). When set, validation
rejects any constituent whose declared size falls below it with a
clear error naming both the offender and the numbers.

## Runtime exposure

AlloyProvider::min_context_window() returns u32 (no longer Option),
since size declaration is always present.

## Tests

7 tests: round_robin/weighted/stats unchanged; new tests cover
auto-compute from mixed-size constituents, shared-size parity, explicit
min priority, and explanatory error on constituent-below-floor. The
"no declared size" test from the previous draft is deleted — that
state is now unreachable.

## Live config migration

Deployed alongside this PR: .210's kimi-for-coding alloy gets
context_window declared on both constituents (262144 for Kimi,
64000 for DeepSeek V3). No other nodes use [[alloys]] today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bglusman added a commit that referenced this pull request Apr 23, 2026
- cargo fmt: collapse multi-line array in alloy.rs test (CI fmt gate)
- Reject context_window=0 on constituents (clear misconfig error)
- Reject min_context_window=0 at alloy build (same)
- Drop stale doc reference to docs/rfcs/model-gateway-primitives.md
  (lives in PR #13, not yet on main)
- Simplify min_context_window doc: context_window is required on
  constituents now, so "treated as unknown" qualifier is dead.

Addresses Copilot review feedback on #14.
- Cascade semantics: reconcile "errors only" with "pre-skip unfit steps".
  Now distinguishes ELIGIBILITY (size-based, checked before attempt) from
  RETRY TRIGGER (errors: timeout, 5xx, 429).
- Fix fabricated mechanism reference: there is no `fallbacks` field on
  AlloyConfig. Describe the real mechanism (`ordered_models` returned
  from `select_plan` and iterated in `route_with_fallback`).
- CharRatio safety margin: explicitly flag that 10% is insufficient for
  CJK-heavy prompts; suggest remediations (tune chars_per_token, raise
  safety_margin, or switch to Tiktoken).
- Correct "streaming output doesn't count" — it DOES count against the
  combined context. Describe the input + max_tokens check approach and
  the default output budget callers must supply.
- Fix K-suffix math: 262K = 262*1024 = 268288, not 262144. Add
  clarifying example using "256K" → 262144.
- Remove stale "context_window=0 = unknown" sentinel note; PR #14
  rejects 0 explicitly at alloy validation.
- Call out [tokenizer], [model_defaults], [[models]] as PROPOSED
  schema additions, not current.
Copilot AI review requested due to automatic review settings April 23, 2026 18:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new RFC documenting proposed “model-gateway primitives” (Alloy, Cascade, Dispatcher) and a shared TokenEstimator abstraction to make routing/fallback behavior context-window aware and prevent silent truncation.

Changes:

  • Introduces an RFC defining three routing/blending primitives (alloy, cascade, dispatcher) and how they compose.
  • Proposes a shared TokenEstimator trait (default char-ratio heuristic; future real-tokenizer backends).
  • Specifies safety mechanisms around context window sizing, output-token headroom, and migration strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


```toml
[[alloys]]
id = "fast-smart-blend"
Comment thread docs/rfcs/model-gateway-primitives.md Outdated
weight = 20
```

**Validation:** at `AlloyProvider::new()`, error if any constituent's declared `context_window < min_context_window`. Catches the "I didn't mean to put a 32K and a 262K in the same alloy" footgun at config-load time.
Comment thread docs/rfcs/model-gateway-primitives.md Outdated
Comment on lines +90 to +92
with every remaining constituent appended in deterministic order, and the proxy
iterates them in `route_with_fallback()` until one succeeds. That "all
constituents of the alloy are also fallbacks" pattern is what the cascade
Comment on lines +240 to +244
impl TokenEstimator for CharRatioEstimator {
fn estimate_text(&self, text: &str) -> usize {
let chars = text.chars().count() as f32;
(chars / self.chars_per_token * self.safety_margin).ceil() as usize
}
Comment thread docs/rfcs/model-gateway-primitives.md Outdated

Function definitions + tool results add tokens not present in the user's message. A `list_files` tool definition is ~100 tokens; a directory listing result can be thousands.

Addressed by the `estimate_chat` method taking tools explicitly, and by default safety margin. Power users with tool-heavy flows should bump `safety_margin` (0.15–0.20 reasonable).
Comment thread docs/rfcs/model-gateway-primitives.md Outdated
//
// Rejected if: estimate > ceiling
//
// e.g. raw estimate 180_000 tokens, safety_margin=1.10 applied inside
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Gitleaks has detected a secret with rule-id generic-api-key in commit 0868abd.
If this secret is a true positive, please rotate the secret ASAP.

If this secret is a false positive, you can add the fingerprint below to your .gitleaksignore file and commit the change to this branch.

echo 0868abd89d54b5f01269233147ca0e2f86a66dc1:docs/rfcs/model-gateway-primitives.md:generic-api-key:293 >> .gitleaksignore

Copilot AI review requested due to automatic review settings April 23, 2026 18:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an RFC design document proposing a clearer set of “model-gateway primitives” (Alloy, Cascade, Dispatcher) plus a shared TokenEstimator abstraction to make routing and fallback decisions size-aware and avoid silent truncation in mixed-context deployments.

Changes:

  • Introduces a three-primitive mental model: blending (alloy), ordered fallback (cascade), and request-shape routing (dispatcher).
  • Defines a shared TokenEstimator trait with a default char-ratio heuristic and future pluggable “real tokenizer” implementations.
  • Specifies config-surface proposals, safety rules (fit checks, capacity derating), and migration notes for upcoming implementation PRs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +491 to +502
### 5. Context-window units

All sizes are in **tokens**. Not characters, not bytes. Config authors can use `K` suffix for readability; `K` means `* 1024` (binary convention):

```toml
context_window = 262144 # tokens (2^18)
# equivalently:
context_window = "256K" # parsed as 256 * 1024 = 262144
```

A literal "262K" would parse as 262 * 1024 = 268288, not 262144 — use `"256K"` if you want the `262144` value.


**Runtime behavior:** before trying step N, estimate request tokens; skip to step N+1 if request doesn't fit. Track which steps were attempted for telemetry. Fail with clear error if no step can serve.

**Discussion open:** should cascade skip-on-size be silent, or emit a warning per downgrade? *Recommendation: warning log at INFO level per skipped step; final error if everything skipped.*
Comment on lines +506 to +507
- **Dispatcher**: rule targets are either sized (matchable) or "any" (catch-all, always matches last). Unsized rules only make sense as catch-alls.
- **Cascade**: unsized steps are always considered eligible (no pre-skip); the step is attempted and errors surface as normal cascade failures.

## Next steps if this RFC lands

1. Small safety PR — add `context_window` to `AlloyConstituentConfig`, `min_context_window` to `AlloyConfig`, validation in `AlloyProvider::new()`. No runtime behavior change beyond rejecting bad configs at startup. *(Already scoped as task #23.)*
- Include `name` field in alloy TOML example (AlloyConfig still requires it).
- Reference `AlloyProvider::from_config()` (actual) instead of
  `AlloyProvider::new()` (non-existent).
- Clarify ordered_models determinism: round_robin deterministic; weighted
  is random sampling without replacement. Cascade, unlike alloy, is
  always deterministic (declaration order).
- Resolve safety_margin double-apply ambiguity: estimator returns a
  margin-applied count; caller never multiplies again. Fit-check becomes
  estimate > context_window × capacity_fraction (one multiplication, one
  comparison). Reworked the worked example accordingly.
- Fix `safety_margin 0.15–0.20` typo — the value is a multiplier, so
  "bump it" means 1.15–1.20, not 0.15 (which would shrink the estimate).
@bglusman bglusman force-pushed the rfc/model-gateway-primitives branch from d1f0c0c to c9c5845 Compare April 23, 2026 18:59
@bglusman bglusman marked this pull request as ready for review April 23, 2026 21:51
Copilot AI review requested due to automatic review settings April 23, 2026 21:51
@bglusman bglusman merged commit f7858a0 into main Apr 23, 2026
15 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an RFC documenting a proposed “model gateway” architecture for zeroclawed to prevent silent truncation and support size-aware routing across heterogeneous models (e.g., local small-context + remote large-context).

Changes:

  • Introduces three primitives: alloy (blend equivalent models), cascade (ordered on-error fallback), and dispatcher (size-/rule-based selection).
  • Proposes a shared TokenEstimator trait plus default CharRatioEstimator, with optional real-tokenizer implementations behind feature flags.
  • Defines config-surface proposals including context_window, min_context_window, capacity_fraction, tokenizer overrides, and migration notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +450 to +451
**Rule**: cascades are *not* size-safe on their own — they must be used at a level where all members can serve the incoming request, OR wrapped in a dispatcher.

Comment on lines +491 to +501
### 5. Context-window units

All sizes are in **tokens**. Not characters, not bytes. Config authors can use `K` suffix for readability; `K` means `* 1024` (binary convention):

```toml
context_window = 262144 # tokens (2^18)
# equivalently:
context_window = "256K" # parsed as 256 * 1024 = 262144
```

A literal "262K" would parse as 262 * 1024 = 268288, not 262144 — use `"256K"` if you want the `262144` value.

## Next steps if this RFC lands

1. Small safety PR — add `context_window` to `AlloyConstituentConfig`, `min_context_window` to `AlloyConfig`, validation in `AlloyProvider::new()`. No runtime behavior change beyond rejecting bad configs at startup. *(Already scoped as task #23.)*
Comment on lines +3 to +4
Status: **draft — decisions incorporated from first review round.** Implementation PRs can start.


**Runtime behavior:** before trying step N, estimate request tokens; skip to step N+1 if request doesn't fit. Track which steps were attempted for telemetry. Fail with clear error if no step can serve.

**Discussion open:** should cascade skip-on-size be silent, or emit a warning per downgrade? *Recommendation: warning log at INFO level per skipped step; final error if everything skipped.*
bglusman added a commit that referenced this pull request Apr 23, 2026
* feat(alloy): require context_window + min_context_window safety

Per RFC #13 review decision (no back-compat in prototype phase): every
alloy constituent must declare its context_window. Serde enforces the
field at config load, so misconfigured alloys fail with a clear parse
error instead of quietly accepting "unknown" sizes that could mask
routing bugs later.

## Config changes

AlloyConstituentConfig.context_window is now a required u32 (was an
optional field in the previous draft). Every existing config must be
updated to declare sizes; live config on .210 already patched in the
matching commit on infra.

AlloyConfig.min_context_window remains optional; when unset, it's
auto-computed as min(constituent.context_window). When set, validation
rejects any constituent whose declared size falls below it with a
clear error naming both the offender and the numbers.

## Runtime exposure

AlloyProvider::min_context_window() returns u32 (no longer Option),
since size declaration is always present.

## Tests

7 tests: round_robin/weighted/stats unchanged; new tests cover
auto-compute from mixed-size constituents, shared-size parity, explicit
min priority, and explanatory error on constituent-below-floor. The
"no declared size" test from the previous draft is deleted — that
state is now unreachable.

## Live config migration

Deployed alongside this PR: .210's kimi-for-coding alloy gets
context_window declared on both constituents (262144 for Kimi,
64000 for DeepSeek V3). No other nodes use [[alloys]] today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(alloy): rustfmt + reject zero context_window + stale doc

- cargo fmt: collapse multi-line array in alloy.rs test (CI fmt gate)
- Reject context_window=0 on constituents (clear misconfig error)
- Reject min_context_window=0 at alloy build (same)
- Drop stale doc reference to docs/rfcs/model-gateway-primitives.md
  (lives in PR #13, not yet on main)
- Simplify min_context_window doc: context_window is required on
  constituents now, so "treated as unknown" qualifier is dead.

Addresses Copilot review feedback on #14.

* test(config): assert missing context_window fails to deserialize

Guards against silently reintroducing a serde default (Option<u32> or
#[serde(default)]) on AlloyConstituentConfig::context_window. Addresses
Copilot review feedback on #14.

* fix(config): use expect_err per clippy

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bglusman added a commit that referenced this pull request Apr 25, 2026
* docs(rfc): model-gateway primitives — alloy + cascade + dispatcher

Proposes three explicitly-scoped primitives for model-gateway routing,
each with distinct semantics, and a shared TokenEstimator trait so they
can reason about context-window fit consistently.

Problem motivating this: alloys today assume constituents are
interchangeable, which breaks when mixing models with wildly different
context windows (e.g., local Qwen at 32K and Kimi K2.6 at 262K). Using
an alloy for size-dependent routing leads to silent truncation.

Proposal:
- Alloy (today, + safety) blends equivalent models via sampling. New:
  min_context_window assertion + runtime fit-check rejection.
- Cascade (promoted to named primitive) tries members in order on error.
  New: skip-on-size when a cascade step can't fit the request.
- Dispatcher (new) picks by request shape, MVP rule type is
  max_input_tokens; extensible to other matchers.

Shared TokenEstimator trait:
- Default: CharRatioEstimator (chars-per-token configurable, 3.5 default)
- Safety margin (default 10%) to bias toward over-estimation and avoid
  silent truncation footgun.
- Pluggable for real tokenizers (tiktoken-rs, sentencepiece) as future
  opt-in crates behind feature flags.
- Configurable globally, per-primitive, and per-model with clear
  precedence.

Includes naming discussion (going with "dispatcher"), migration plan
(no breaking changes), edge cases (session-context growth, tool-use
inflation, streaming output), and open questions for reviewers.

Implementation split into focused follow-up PRs. RFC is the long-form
design; PRs execute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(rfc): incorporate first-round review decisions

Captures reviewer resolutions:
- dispatcher confirmed as the size-routing primitive name
- cascade promoted to a named primitive
- safety margin split into two distinct knobs: estimator safety_margin
  (counting-accuracy pad) and per-model capacity_fraction (avoid the
  quality-degradation zone near a model's ceiling). Composition formula
  and rationale added.
- TiktokenEstimator included in v1 behind a feature flag; SentencePiece
  deferred to avoid C++ deps + per-model vocab plumbing for now
- dispatcher reevaluate default flipped to per_turn (task-completion
  flows benefit from auto-promotion over consistency); sticky and
  sticky_escalate documented as opt-ins
- dispatcher rule semantics simplified: default is "first target whose
  effective ceiling fits the request", computed per-target from
  context_window × capacity_fraction. Explicit rules remain for
  non-size routing.
- alloy context_window required on every constituent. No back-compat
  for missing fields — prototype phase, owned installations, worth the
  one-time config edit to eliminate silent truncation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(rfc): address Copilot review on model-gateway-primitives

- Cascade semantics: reconcile "errors only" with "pre-skip unfit steps".
  Now distinguishes ELIGIBILITY (size-based, checked before attempt) from
  RETRY TRIGGER (errors: timeout, 5xx, 429).
- Fix fabricated mechanism reference: there is no `fallbacks` field on
  AlloyConfig. Describe the real mechanism (`ordered_models` returned
  from `select_plan` and iterated in `route_with_fallback`).
- CharRatio safety margin: explicitly flag that 10% is insufficient for
  CJK-heavy prompts; suggest remediations (tune chars_per_token, raise
  safety_margin, or switch to Tiktoken).
- Correct "streaming output doesn't count" — it DOES count against the
  combined context. Describe the input + max_tokens check approach and
  the default output budget callers must supply.
- Fix K-suffix math: 262K = 262*1024 = 268288, not 262144. Add
  clarifying example using "256K" → 262144.
- Remove stale "context_window=0 = unknown" sentinel note; PR #14
  rejects 0 explicitly at alloy validation.
- Call out [tokenizer], [model_defaults], [[models]] as PROPOSED
  schema additions, not current.

* docs(rfc): second round of Copilot review — new feedback on updated push

- Include `name` field in alloy TOML example (AlloyConfig still requires it).
- Reference `AlloyProvider::from_config()` (actual) instead of
  `AlloyProvider::new()` (non-existent).
- Clarify ordered_models determinism: round_robin deterministic; weighted
  is random sampling without replacement. Cascade, unlike alloy, is
  always deterministic (declaration order).
- Resolve safety_margin double-apply ambiguity: estimator returns a
  margin-applied count; caller never multiplies again. Fit-check becomes
  estimate > context_window × capacity_fraction (one multiplication, one
  comparison). Reworked the worked example accordingly.
- Fix `safety_margin 0.15–0.20` typo — the value is a multiplier, so
  "bump it" means 1.15–1.20, not 0.15 (which would shrink the estimate).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bglusman added a commit that referenced this pull request Apr 25, 2026
* feat(alloy): require context_window + min_context_window safety

Per RFC #13 review decision (no back-compat in prototype phase): every
alloy constituent must declare its context_window. Serde enforces the
field at config load, so misconfigured alloys fail with a clear parse
error instead of quietly accepting "unknown" sizes that could mask
routing bugs later.

## Config changes

AlloyConstituentConfig.context_window is now a required u32 (was an
optional field in the previous draft). Every existing config must be
updated to declare sizes; live config on .210 already patched in the
matching commit on infra.

AlloyConfig.min_context_window remains optional; when unset, it's
auto-computed as min(constituent.context_window). When set, validation
rejects any constituent whose declared size falls below it with a
clear error naming both the offender and the numbers.

## Runtime exposure

AlloyProvider::min_context_window() returns u32 (no longer Option),
since size declaration is always present.

## Tests

7 tests: round_robin/weighted/stats unchanged; new tests cover
auto-compute from mixed-size constituents, shared-size parity, explicit
min priority, and explanatory error on constituent-below-floor. The
"no declared size" test from the previous draft is deleted — that
state is now unreachable.

## Live config migration

Deployed alongside this PR: .210's kimi-for-coding alloy gets
context_window declared on both constituents (262144 for Kimi,
64000 for DeepSeek V3). No other nodes use [[alloys]] today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(alloy): rustfmt + reject zero context_window + stale doc

- cargo fmt: collapse multi-line array in alloy.rs test (CI fmt gate)
- Reject context_window=0 on constituents (clear misconfig error)
- Reject min_context_window=0 at alloy build (same)
- Drop stale doc reference to docs/rfcs/model-gateway-primitives.md
  (lives in PR #13, not yet on main)
- Simplify min_context_window doc: context_window is required on
  constituents now, so "treated as unknown" qualifier is dead.

Addresses Copilot review feedback on #14.

* test(config): assert missing context_window fails to deserialize

Guards against silently reintroducing a serde default (Option<u32> or
#[serde(default)]) on AlloyConstituentConfig::context_window. Addresses
Copilot review feedback on #14.

* fix(config): use expect_err per clippy

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bglusman bglusman deleted the rfc/model-gateway-primitives branch May 1, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants