RFC: model-gateway primitives (alloy + cascade + dispatcher)#13
RFC: model-gateway primitives (alloy + cascade + dispatcher)#13
Conversation
Proposes three explicitly-scoped primitives for model-gateway routing, each with distinct semantics, and a shared TokenEstimator trait so they can reason about context-window fit consistently. Problem motivating this: alloys today assume constituents are interchangeable, which breaks when mixing models with wildly different context windows (e.g., local Qwen at 32K and Kimi K2.6 at 262K). Using an alloy for size-dependent routing leads to silent truncation. Proposal: - Alloy (today, + safety) blends equivalent models via sampling. New: min_context_window assertion + runtime fit-check rejection. - Cascade (promoted to named primitive) tries members in order on error. New: skip-on-size when a cascade step can't fit the request. - Dispatcher (new) picks by request shape, MVP rule type is max_input_tokens; extensible to other matchers. Shared TokenEstimator trait: - Default: CharRatioEstimator (chars-per-token configurable, 3.5 default) - Safety margin (default 10%) to bias toward over-estimation and avoid silent truncation footgun. - Pluggable for real tokenizers (tiktoken-rs, sentencepiece) as future opt-in crates behind feature flags. - Configurable globally, per-primitive, and per-model with clear precedence. Includes naming discussion (going with "dispatcher"), migration plan (no breaking changes), edge cases (session-context growth, tool-use inflation, streaming output), and open questions for reviewers. Implementation split into focused follow-up PRs. RFC is the long-form design; PRs execute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an RFC/design document to define model-gateway routing primitives and token fit-checking semantics for zeroclawed, motivated by hybrid deployments with widely differing context windows.
Changes:
- Introduces three proposed gateway primitives: Alloy (equivalent-model blending), Cascade (ordered fallback), and Dispatcher (request-shape routing).
- Proposes a shared
TokenEstimatortrait with a default char-ratio heuristic and future pluggable tokenizer backends. - Documents composition patterns, migration notes, and open questions for reviewers (naming, safety margin, re-evaluation strategy).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| id = "kimi-with-fallback" | ||
| # First success wins. Cascading ONLY on error (not on size). | ||
| # Caller's responsibility: ensure request fits the NARROWEST member, OR wrap cascade inside a dispatcher. | ||
| [[cascades.steps]] |
| ```toml | ||
| context_window = 262144 # tokens | ||
| # equivalently: | ||
| context_window = "262K" # parsed as 262 * 1024 |
| **Rationale for default values:** | ||
|
|
||
| - **3.5 chars/token** — common average for English prose with GPT-family tokenizers. Code and Chinese text are denser (~2.5 and ~1.5 respectively). Under-estimating for code/non-English is the *risk case*, hence the safety margin. | ||
| - **10% safety margin** — covers most under-estimates without wasting model context. | ||
|
|
| Output tokens don't count against the input-context budget. Our estimator is *input* estimator only. | ||
|
|
||
| ### 4. Reasoning tokens (e.g., Kimi K2.6 reasoning mode) | ||
|
|
||
| Reasoning tokens add output cost but are usually *not* in the input context. Estimator doesn't need to account for them. |
| ```toml | ||
| [model_defaults] | ||
| chars_per_token = 3.5 | ||
| safety_margin = 1.10 | ||
|
|
||
| [[models]] | ||
| id = "qwen3.5-35b" | ||
| context_window = 32768 | ||
| chars_per_token = 3.0 # Qwen tokenizer tends denser | ||
|
|
||
| [[models]] | ||
| id = "kimi-k2.6" | ||
| context_window = 262144 | ||
| chars_per_token = 2.8 # Chinese-English mixed, code-heavy | ||
| ``` | ||
|
|
||
| ### Config surface | ||
|
|
||
| **Global default** in top-level config: | ||
|
|
||
| ```toml | ||
| [tokenizer] | ||
| kind = "char_ratio" # "char_ratio" | "tiktoken" | "sentencepiece" | ||
| chars_per_token = 3.5 | ||
| safety_margin = 1.10 | ||
| ``` |
| - **Dispatcher**: rule targets are either sized (matchable) or "any" (catch-all, always matches last). Unsized rules only make sense as catch-alls. | ||
| - **Cascade**: unsized steps always considered eligible (no pre-skip). | ||
|
|
||
| Explicit `context_window = 0` means "unknown/unbounded" and skips size-checks for that member. |
|
|
||
| **Purpose:** reliability. Try primary, on timeout/5xx/429 try secondary. | ||
|
|
||
| **Today's behavior** (implicit in alloy `fallbacks` array) → **promoted** to its own named primitive for clarity: |
Captures reviewer resolutions: - dispatcher confirmed as the size-routing primitive name - cascade promoted to a named primitive - safety margin split into two distinct knobs: estimator safety_margin (counting-accuracy pad) and per-model capacity_fraction (avoid the quality-degradation zone near a model's ceiling). Composition formula and rationale added. - TiktokenEstimator included in v1 behind a feature flag; SentencePiece deferred to avoid C++ deps + per-model vocab plumbing for now - dispatcher reevaluate default flipped to per_turn (task-completion flows benefit from auto-promotion over consistency); sticky and sticky_escalate documented as opt-ins - dispatcher rule semantics simplified: default is "first target whose effective ceiling fits the request", computed per-target from context_window × capacity_fraction. Explicit rules remain for non-size routing. - alloy context_window required on every constituent. No back-compat for missing fields — prototype phase, owned installations, worth the one-time config edit to eliminate silent truncation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per RFC #13 review decision (no back-compat in prototype phase): every alloy constituent must declare its context_window. Serde enforces the field at config load, so misconfigured alloys fail with a clear parse error instead of quietly accepting "unknown" sizes that could mask routing bugs later. ## Config changes AlloyConstituentConfig.context_window is now a required u32 (was an optional field in the previous draft). Every existing config must be updated to declare sizes; live config on .210 already patched in the matching commit on infra. AlloyConfig.min_context_window remains optional; when unset, it's auto-computed as min(constituent.context_window). When set, validation rejects any constituent whose declared size falls below it with a clear error naming both the offender and the numbers. ## Runtime exposure AlloyProvider::min_context_window() returns u32 (no longer Option), since size declaration is always present. ## Tests 7 tests: round_robin/weighted/stats unchanged; new tests cover auto-compute from mixed-size constituents, shared-size parity, explicit min priority, and explanatory error on constituent-below-floor. The "no declared size" test from the previous draft is deleted — that state is now unreachable. ## Live config migration Deployed alongside this PR: .210's kimi-for-coding alloy gets context_window declared on both constituents (262144 for Kimi, 64000 for DeepSeek V3). No other nodes use [[alloys]] today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- cargo fmt: collapse multi-line array in alloy.rs test (CI fmt gate) - Reject context_window=0 on constituents (clear misconfig error) - Reject min_context_window=0 at alloy build (same) - Drop stale doc reference to docs/rfcs/model-gateway-primitives.md (lives in PR #13, not yet on main) - Simplify min_context_window doc: context_window is required on constituents now, so "treated as unknown" qualifier is dead. Addresses Copilot review feedback on #14.
- Cascade semantics: reconcile "errors only" with "pre-skip unfit steps". Now distinguishes ELIGIBILITY (size-based, checked before attempt) from RETRY TRIGGER (errors: timeout, 5xx, 429). - Fix fabricated mechanism reference: there is no `fallbacks` field on AlloyConfig. Describe the real mechanism (`ordered_models` returned from `select_plan` and iterated in `route_with_fallback`). - CharRatio safety margin: explicitly flag that 10% is insufficient for CJK-heavy prompts; suggest remediations (tune chars_per_token, raise safety_margin, or switch to Tiktoken). - Correct "streaming output doesn't count" — it DOES count against the combined context. Describe the input + max_tokens check approach and the default output budget callers must supply. - Fix K-suffix math: 262K = 262*1024 = 268288, not 262144. Add clarifying example using "256K" → 262144. - Remove stale "context_window=0 = unknown" sentinel note; PR #14 rejects 0 explicitly at alloy validation. - Call out [tokenizer], [model_defaults], [[models]] as PROPOSED schema additions, not current.
There was a problem hiding this comment.
Pull request overview
Adds a new RFC documenting proposed “model-gateway primitives” (Alloy, Cascade, Dispatcher) and a shared TokenEstimator abstraction to make routing/fallback behavior context-window aware and prevent silent truncation.
Changes:
- Introduces an RFC defining three routing/blending primitives (alloy, cascade, dispatcher) and how they compose.
- Proposes a shared
TokenEstimatortrait (default char-ratio heuristic; future real-tokenizer backends). - Specifies safety mechanisms around context window sizing, output-token headroom, and migration strategy.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ```toml | ||
| [[alloys]] | ||
| id = "fast-smart-blend" |
| weight = 20 | ||
| ``` | ||
|
|
||
| **Validation:** at `AlloyProvider::new()`, error if any constituent's declared `context_window < min_context_window`. Catches the "I didn't mean to put a 32K and a 262K in the same alloy" footgun at config-load time. |
| with every remaining constituent appended in deterministic order, and the proxy | ||
| iterates them in `route_with_fallback()` until one succeeds. That "all | ||
| constituents of the alloy are also fallbacks" pattern is what the cascade |
| impl TokenEstimator for CharRatioEstimator { | ||
| fn estimate_text(&self, text: &str) -> usize { | ||
| let chars = text.chars().count() as f32; | ||
| (chars / self.chars_per_token * self.safety_margin).ceil() as usize | ||
| } |
|
|
||
| Function definitions + tool results add tokens not present in the user's message. A `list_files` tool definition is ~100 tokens; a directory listing result can be thousands. | ||
|
|
||
| Addressed by the `estimate_chat` method taking tools explicitly, and by default safety margin. Power users with tool-heavy flows should bump `safety_margin` (0.15–0.20 reasonable). |
| // | ||
| // Rejected if: estimate > ceiling | ||
| // | ||
| // e.g. raw estimate 180_000 tokens, safety_margin=1.10 applied inside |
There was a problem hiding this comment.
🛑 Gitleaks has detected a secret with rule-id generic-api-key in commit 0868abd.
If this secret is a true positive, please rotate the secret ASAP.
If this secret is a false positive, you can add the fingerprint below to your .gitleaksignore file and commit the change to this branch.
echo 0868abd89d54b5f01269233147ca0e2f86a66dc1:docs/rfcs/model-gateway-primitives.md:generic-api-key:293 >> .gitleaksignore
There was a problem hiding this comment.
Pull request overview
This PR adds an RFC design document proposing a clearer set of “model-gateway primitives” (Alloy, Cascade, Dispatcher) plus a shared TokenEstimator abstraction to make routing and fallback decisions size-aware and avoid silent truncation in mixed-context deployments.
Changes:
- Introduces a three-primitive mental model: blending (
alloy), ordered fallback (cascade), and request-shape routing (dispatcher). - Defines a shared
TokenEstimatortrait with a default char-ratio heuristic and future pluggable “real tokenizer” implementations. - Specifies config-surface proposals, safety rules (fit checks, capacity derating), and migration notes for upcoming implementation PRs.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ### 5. Context-window units | ||
|
|
||
| All sizes are in **tokens**. Not characters, not bytes. Config authors can use `K` suffix for readability; `K` means `* 1024` (binary convention): | ||
|
|
||
| ```toml | ||
| context_window = 262144 # tokens (2^18) | ||
| # equivalently: | ||
| context_window = "256K" # parsed as 256 * 1024 = 262144 | ||
| ``` | ||
|
|
||
| A literal "262K" would parse as 262 * 1024 = 268288, not 262144 — use `"256K"` if you want the `262144` value. | ||
|
|
|
|
||
| **Runtime behavior:** before trying step N, estimate request tokens; skip to step N+1 if request doesn't fit. Track which steps were attempted for telemetry. Fail with clear error if no step can serve. | ||
|
|
||
| **Discussion open:** should cascade skip-on-size be silent, or emit a warning per downgrade? *Recommendation: warning log at INFO level per skipped step; final error if everything skipped.* |
| - **Dispatcher**: rule targets are either sized (matchable) or "any" (catch-all, always matches last). Unsized rules only make sense as catch-alls. | ||
| - **Cascade**: unsized steps are always considered eligible (no pre-skip); the step is attempted and errors surface as normal cascade failures. |
|
|
||
| ## Next steps if this RFC lands | ||
|
|
||
| 1. Small safety PR — add `context_window` to `AlloyConstituentConfig`, `min_context_window` to `AlloyConfig`, validation in `AlloyProvider::new()`. No runtime behavior change beyond rejecting bad configs at startup. *(Already scoped as task #23.)* |
- Include `name` field in alloy TOML example (AlloyConfig still requires it). - Reference `AlloyProvider::from_config()` (actual) instead of `AlloyProvider::new()` (non-existent). - Clarify ordered_models determinism: round_robin deterministic; weighted is random sampling without replacement. Cascade, unlike alloy, is always deterministic (declaration order). - Resolve safety_margin double-apply ambiguity: estimator returns a margin-applied count; caller never multiplies again. Fit-check becomes estimate > context_window × capacity_fraction (one multiplication, one comparison). Reworked the worked example accordingly. - Fix `safety_margin 0.15–0.20` typo — the value is a multiplier, so "bump it" means 1.15–1.20, not 0.15 (which would shrink the estimate).
d1f0c0c to
c9c5845
Compare
There was a problem hiding this comment.
Pull request overview
Adds an RFC documenting a proposed “model gateway” architecture for zeroclawed to prevent silent truncation and support size-aware routing across heterogeneous models (e.g., local small-context + remote large-context).
Changes:
- Introduces three primitives:
alloy(blend equivalent models),cascade(ordered on-error fallback), anddispatcher(size-/rule-based selection). - Proposes a shared
TokenEstimatortrait plus defaultCharRatioEstimator, with optional real-tokenizer implementations behind feature flags. - Defines config-surface proposals including
context_window,min_context_window,capacity_fraction, tokenizer overrides, and migration notes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| **Rule**: cascades are *not* size-safe on their own — they must be used at a level where all members can serve the incoming request, OR wrapped in a dispatcher. | ||
|
|
| ### 5. Context-window units | ||
|
|
||
| All sizes are in **tokens**. Not characters, not bytes. Config authors can use `K` suffix for readability; `K` means `* 1024` (binary convention): | ||
|
|
||
| ```toml | ||
| context_window = 262144 # tokens (2^18) | ||
| # equivalently: | ||
| context_window = "256K" # parsed as 256 * 1024 = 262144 | ||
| ``` | ||
|
|
||
| A literal "262K" would parse as 262 * 1024 = 268288, not 262144 — use `"256K"` if you want the `262144` value. |
|
|
||
| ## Next steps if this RFC lands | ||
|
|
||
| 1. Small safety PR — add `context_window` to `AlloyConstituentConfig`, `min_context_window` to `AlloyConfig`, validation in `AlloyProvider::new()`. No runtime behavior change beyond rejecting bad configs at startup. *(Already scoped as task #23.)* |
| Status: **draft — decisions incorporated from first review round.** Implementation PRs can start. | ||
|
|
|
|
||
| **Runtime behavior:** before trying step N, estimate request tokens; skip to step N+1 if request doesn't fit. Track which steps were attempted for telemetry. Fail with clear error if no step can serve. | ||
|
|
||
| **Discussion open:** should cascade skip-on-size be silent, or emit a warning per downgrade? *Recommendation: warning log at INFO level per skipped step; final error if everything skipped.* |
* feat(alloy): require context_window + min_context_window safety Per RFC #13 review decision (no back-compat in prototype phase): every alloy constituent must declare its context_window. Serde enforces the field at config load, so misconfigured alloys fail with a clear parse error instead of quietly accepting "unknown" sizes that could mask routing bugs later. ## Config changes AlloyConstituentConfig.context_window is now a required u32 (was an optional field in the previous draft). Every existing config must be updated to declare sizes; live config on .210 already patched in the matching commit on infra. AlloyConfig.min_context_window remains optional; when unset, it's auto-computed as min(constituent.context_window). When set, validation rejects any constituent whose declared size falls below it with a clear error naming both the offender and the numbers. ## Runtime exposure AlloyProvider::min_context_window() returns u32 (no longer Option), since size declaration is always present. ## Tests 7 tests: round_robin/weighted/stats unchanged; new tests cover auto-compute from mixed-size constituents, shared-size parity, explicit min priority, and explanatory error on constituent-below-floor. The "no declared size" test from the previous draft is deleted — that state is now unreachable. ## Live config migration Deployed alongside this PR: .210's kimi-for-coding alloy gets context_window declared on both constituents (262144 for Kimi, 64000 for DeepSeek V3). No other nodes use [[alloys]] today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(alloy): rustfmt + reject zero context_window + stale doc - cargo fmt: collapse multi-line array in alloy.rs test (CI fmt gate) - Reject context_window=0 on constituents (clear misconfig error) - Reject min_context_window=0 at alloy build (same) - Drop stale doc reference to docs/rfcs/model-gateway-primitives.md (lives in PR #13, not yet on main) - Simplify min_context_window doc: context_window is required on constituents now, so "treated as unknown" qualifier is dead. Addresses Copilot review feedback on #14. * test(config): assert missing context_window fails to deserialize Guards against silently reintroducing a serde default (Option<u32> or #[serde(default)]) on AlloyConstituentConfig::context_window. Addresses Copilot review feedback on #14. * fix(config): use expect_err per clippy --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(rfc): model-gateway primitives — alloy + cascade + dispatcher Proposes three explicitly-scoped primitives for model-gateway routing, each with distinct semantics, and a shared TokenEstimator trait so they can reason about context-window fit consistently. Problem motivating this: alloys today assume constituents are interchangeable, which breaks when mixing models with wildly different context windows (e.g., local Qwen at 32K and Kimi K2.6 at 262K). Using an alloy for size-dependent routing leads to silent truncation. Proposal: - Alloy (today, + safety) blends equivalent models via sampling. New: min_context_window assertion + runtime fit-check rejection. - Cascade (promoted to named primitive) tries members in order on error. New: skip-on-size when a cascade step can't fit the request. - Dispatcher (new) picks by request shape, MVP rule type is max_input_tokens; extensible to other matchers. Shared TokenEstimator trait: - Default: CharRatioEstimator (chars-per-token configurable, 3.5 default) - Safety margin (default 10%) to bias toward over-estimation and avoid silent truncation footgun. - Pluggable for real tokenizers (tiktoken-rs, sentencepiece) as future opt-in crates behind feature flags. - Configurable globally, per-primitive, and per-model with clear precedence. Includes naming discussion (going with "dispatcher"), migration plan (no breaking changes), edge cases (session-context growth, tool-use inflation, streaming output), and open questions for reviewers. Implementation split into focused follow-up PRs. RFC is the long-form design; PRs execute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(rfc): incorporate first-round review decisions Captures reviewer resolutions: - dispatcher confirmed as the size-routing primitive name - cascade promoted to a named primitive - safety margin split into two distinct knobs: estimator safety_margin (counting-accuracy pad) and per-model capacity_fraction (avoid the quality-degradation zone near a model's ceiling). Composition formula and rationale added. - TiktokenEstimator included in v1 behind a feature flag; SentencePiece deferred to avoid C++ deps + per-model vocab plumbing for now - dispatcher reevaluate default flipped to per_turn (task-completion flows benefit from auto-promotion over consistency); sticky and sticky_escalate documented as opt-ins - dispatcher rule semantics simplified: default is "first target whose effective ceiling fits the request", computed per-target from context_window × capacity_fraction. Explicit rules remain for non-size routing. - alloy context_window required on every constituent. No back-compat for missing fields — prototype phase, owned installations, worth the one-time config edit to eliminate silent truncation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(rfc): address Copilot review on model-gateway-primitives - Cascade semantics: reconcile "errors only" with "pre-skip unfit steps". Now distinguishes ELIGIBILITY (size-based, checked before attempt) from RETRY TRIGGER (errors: timeout, 5xx, 429). - Fix fabricated mechanism reference: there is no `fallbacks` field on AlloyConfig. Describe the real mechanism (`ordered_models` returned from `select_plan` and iterated in `route_with_fallback`). - CharRatio safety margin: explicitly flag that 10% is insufficient for CJK-heavy prompts; suggest remediations (tune chars_per_token, raise safety_margin, or switch to Tiktoken). - Correct "streaming output doesn't count" — it DOES count against the combined context. Describe the input + max_tokens check approach and the default output budget callers must supply. - Fix K-suffix math: 262K = 262*1024 = 268288, not 262144. Add clarifying example using "256K" → 262144. - Remove stale "context_window=0 = unknown" sentinel note; PR #14 rejects 0 explicitly at alloy validation. - Call out [tokenizer], [model_defaults], [[models]] as PROPOSED schema additions, not current. * docs(rfc): second round of Copilot review — new feedback on updated push - Include `name` field in alloy TOML example (AlloyConfig still requires it). - Reference `AlloyProvider::from_config()` (actual) instead of `AlloyProvider::new()` (non-existent). - Clarify ordered_models determinism: round_robin deterministic; weighted is random sampling without replacement. Cascade, unlike alloy, is always deterministic (declaration order). - Resolve safety_margin double-apply ambiguity: estimator returns a margin-applied count; caller never multiplies again. Fit-check becomes estimate > context_window × capacity_fraction (one multiplication, one comparison). Reworked the worked example accordingly. - Fix `safety_margin 0.15–0.20` typo — the value is a multiplier, so "bump it" means 1.15–1.20, not 0.15 (which would shrink the estimate). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(alloy): require context_window + min_context_window safety Per RFC #13 review decision (no back-compat in prototype phase): every alloy constituent must declare its context_window. Serde enforces the field at config load, so misconfigured alloys fail with a clear parse error instead of quietly accepting "unknown" sizes that could mask routing bugs later. ## Config changes AlloyConstituentConfig.context_window is now a required u32 (was an optional field in the previous draft). Every existing config must be updated to declare sizes; live config on .210 already patched in the matching commit on infra. AlloyConfig.min_context_window remains optional; when unset, it's auto-computed as min(constituent.context_window). When set, validation rejects any constituent whose declared size falls below it with a clear error naming both the offender and the numbers. ## Runtime exposure AlloyProvider::min_context_window() returns u32 (no longer Option), since size declaration is always present. ## Tests 7 tests: round_robin/weighted/stats unchanged; new tests cover auto-compute from mixed-size constituents, shared-size parity, explicit min priority, and explanatory error on constituent-below-floor. The "no declared size" test from the previous draft is deleted — that state is now unreachable. ## Live config migration Deployed alongside this PR: .210's kimi-for-coding alloy gets context_window declared on both constituents (262144 for Kimi, 64000 for DeepSeek V3). No other nodes use [[alloys]] today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(alloy): rustfmt + reject zero context_window + stale doc - cargo fmt: collapse multi-line array in alloy.rs test (CI fmt gate) - Reject context_window=0 on constituents (clear misconfig error) - Reject min_context_window=0 at alloy build (same) - Drop stale doc reference to docs/rfcs/model-gateway-primitives.md (lives in PR #13, not yet on main) - Simplify min_context_window doc: context_window is required on constituents now, so "treated as unknown" qualifier is dead. Addresses Copilot review feedback on #14. * test(config): assert missing context_window fails to deserialize Guards against silently reintroducing a serde default (Option<u32> or #[serde(default)]) on AlloyConstituentConfig::context_window. Addresses Copilot review feedback on #14. * fix(config): use expect_err per clippy --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Design doc proposing three clearly-scoped model-gateway primitives plus a shared TokenEstimator trait, in response to the architectural gap surfaced by trying to use alloys for kimi-plus-local hybrids.
Problem: alloys today assume constituents are interchangeable — any could serve any request. That breaks when constituents have wildly different context windows (e.g. local Qwen at 32K + Kimi K2.6 at 262K). A 100K request routed to Qwen silently truncates.
Proposal:
max_input_tokens, extensible laterTokenEstimatortrait with configurable chars-per-token default (3.5) + safety margin (10%), per-model overrides, pluggable real-tokenizer backends as future feature flagsReview asks
Specific questions called out in the doc under "Open questions for reviewers":
dispatchervsrouter/tier/fit/selector— I'm going withdispatcherbut open to alternativesScope boundaries
In this PR: design only. No code changes.
Immediate follow-up (task #23): small safety assertion PR — adds
context_window+min_context_windowfields to alloy config with startup validation. Independent of this RFC's resolution; unblocks alloy safety now.Subsequent follow-ups if RFC approved: TokenEstimator impl, cascade named primitive, dispatcher impl, README updates.
Test plan
🤖 Generated with Claude Code