Skip to content

[Feature]: Typed Config-Runtime Contract — eliminate silent config/state/hook binding gaps #28984

@zccyman

Description

@zccyman

Feature Request: Typed Config-Runtime Contract

Supersedes: This FR has been expanded from the original "Typed Plugin Hook Protocol" scope. Hook calls are one sub-pattern; this FR now covers the full contract gap across config→runtime, state→path, and interface→caller boundaries.


Value Score: 44/60 (73%) — HIGH VALUE

Dimension Score Rationale
Impact Breadth 9/10 17/100 open bugs (17%) share this root cause — #1 recurring pattern
Impact Severity 7/10 Silent failures, hard to debug, user sees wrong behavior with no error
Fix Leverage 9/10 One structural fix prevents entire bug class; individual ~30-line fixes don't
Implementation Feasibility 5/10 Touches core modules, but can be rolled out incrementally
Upstream Receptiveness 6/10 Architectural FRs are harder to accept, but 17-bug evidence is compelling
Proof of Concept 8/10 5 fixes already demonstrate the pattern; get_custom_provider_model_field() is a working micro-example

No existing FR covers this scope. Related:


Problem

Hermes has no contract layer between config/state producers and consumers. When a new field is added to config.yaml, a hook signature, or a state object, there is no mechanism to ensure every downstream consumer picks it up. The result: silent failures that only surface when users hit the un-updated code path.

This is not a theory — it is the #1 recurring bug pattern in the open issue tracker.

Evidence: 17 bugs, one root cause, four sub-patterns

We analyzed 100 open bug issues and found 17 that follow the exact same structural defect. They break into four sub-patterns:

Sub-pattern 1: Config field declared, no consumer (4 bugs)

New config.yaml key added and documented, but the bridge code that maps it to a runtime variable/env-var was never updated.

Bug Symptom
#28046 custom_providers[].models.*.max_tokens ignored, always defaults to 4096
#28863 terminal.docker_extra_args silently dropped — missing from _terminal_env_map
#28651 web_tools hardcodes provider list, ignores configured providers
#28034 /model --global doesn't persist when using the visual model picker

Sub-pattern 2: Path A works, path B broken (5 bugs)

Feature works correctly on one execution path (e.g., startup) but is broken on another (e.g., /model switch, gateway restart, fallback activation). The logic was copy-pasted or reimplemented instead of shared.

Bug Symptom
#28753 TUI doesn't propagate fallback_model/fallback_providers — gateway path works
#28825 OpenAI-compatible API doesn't honour tools param — Anthropic path works
#28746 session:end event not emitted from idle-expiry/auto-reset path — normal close works
#28637 Per-model token usage lost during /model switch — init path records it
#28023 Credential pool strategy not honoured on fallback — startup path reads it

Sub-pattern 3: Interface expanded, caller missed (3 bugs)

A hook signature, callback, or protocol method was extended with new parameters, but not all call sites were updated.

Bug Symptom
#28961 pre_tool_call hook missing session_id/tool_call_id in 2 of 3 call sites
#28296 OpenViking missing on_session_switch() — interface method added, implementation missed
#28662 hermes cron list crashes on MCP-created jobs — schedule field type assumption differs

Sub-pattern 4: State added, not propagated (3 bugs)

A new field was added to a state dict/object, but not all paths that serialize/deserialize/transform that state preserve the new field.

Bug Symptom
#28841 Message timestamp lost during fork/compress/branch — always DB write-time
#28632 Gateway restart leaves launchd service unloaded — stop path cleans up, restart doesn't
#28489 Gateway persists invalid /model status override and keeps reusing it

(✅ = already fixed, listed here as evidence the pattern is real and fixable)

Also related (same structural pattern, different domain)

Bug Domain Same root cause
#28598 Display build_tool_preview() hardcoded if-elif chain — new tool → forgotten entry → #28621
#28663 Gateway Exec quick commands blocked during drain — one path checked, other didn't

Why individual fixes don't scale

Each of the 5 bugs we fixed required ~30 lines of targeted code. But fixing #28046 (max_tokens) did nothing to prevent #28863 (docker_extra_args) — they're the same structural defect manifesting in different config keys. Every new config field or code path is a ticking time bomb until someone reports it.

The pattern will keep recurring as long as producer→consumer bindings remain implicit.

Proposed Solution

A lightweight contract layer that makes bindings declarative and verifiable — not a big-bang rewrite, but an incremental rollout:

Phase 1: Config Field Registry + Startup Validator

Define a registry mapping config keys to their expected consumers:

CONFIG_CONTRACTS = {
    "terminal.docker_extra_args": {
        "env_var": "TERMINAL_DOCKER_EXTRA_ARGS",
        "consumer": "gateway.run._terminal_env_map",
        "type": list[str],
    },
    "custom_providers.*.models.*.max_tokens": {
        "consumer": "agent.agent_init",
        "fallback": 4096,
        "type": int,
    },
}

On startup, validate every declared field has at least one active consumer. Emit warnings (not errors) for:

  • Orphan fields (declared in config, no consumer registered)
  • Stale bindings (consumer references a field that no longer exists)

Zero behavior change — purely additive observability.

Phase 2: Typed Hook Payloads

Replace ad-hoc kwargs in plugin hook invocations with typed payloads:

@dataclass
class PreToolCallPayload:
    session_id: str
    tool_call_id: str
    tool_name: str
    tool_input: dict
    # Future fields added here automatically propagate

# Single entry point constructs the payload — all call sites get every field

This is the original scope from this FR (#28961, #25204, #7344). A Protocol-based approach ensures any new field added to the payload is structurally visible to all consumers.

Phase 3: Path Parity Tests

A test helper that verifies: if path A (startup) reads/configures field X, then path B (/model switch, gateway restart, fallback) must also read X.

def assert_config_path_parity(field: str, paths: list[str]):
    """Fail CI if any declared path doesn't consume the field."""

This turns "path divergence" bugs into CI failures before they reach users.


What this prevents

Future scenario Without contract With contract
New config field terminal.gpu_layers added Developer forgets to add to env_map → silent drop until user reports Registry shows orphan field at startup → caught immediately
New hook param retry_count added to pre_tool_call Works in sequential path, forgotten in concurrent path Typed payload means all paths get all fields
Model switch adds new state field Init preserves it, switch loses it Path parity test fails in CI
New tool added to plugin build_tool_preview() shows generic output (#28598) Declarative preview registered at tool definition time (#28621)

Existing proof of concept

The generic get_custom_provider_model_field() function introduced in PR #28988 (fixing #28046) is a working micro-example of this approach: instead of separate get_custom_provider_context_length() and get_custom_provider_max_tokens() functions with duplicated logic, we extracted a single generic lookup that any new per-model config field can use without writing new bridge code.


Fixes (as evidence): #28046, #28961, #28841, #28663, #28296
Would prevent: #28863, #28662, #28034, #28753, #28651, #28489, #28023, #28746, #28637, #28825, #28632, #28055
Related: #27342 (complementary), #28621 (same pattern, different domain)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginstype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions