Skip to content

feat(delegate_tool): per-task model/provider override#43134

Closed
rafaumeu wants to merge 1 commit into
NousResearch:mainfrom
rafaumeu:feat/per-task-model-override-18591
Closed

feat(delegate_tool): per-task model/provider override#43134
rafaumeu wants to merge 1 commit into
NousResearch:mainfrom
rafaumeu:feat/per-task-model-override-18591

Conversation

@rafaumeu

@rafaumeu rafaumeu commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Allows each task in a delegate_task batch to override the model and/or provider independently. Addresses issue #18591.

Problem

When delegating multiple tasks via delegate_task, all children share the same model/provider resolved from the top-level delegation config. There is no way to route individual tasks to different providers — for example, sending research tasks to a fast/cheap model while keeping code review on a stronger one.

Solution

Add two optional fields to each task item:

Field Type Description
model string Model name override (e.g. gemini-2.5-flash). Falls back to delegation config, then parent.
provider string Provider name override (e.g. groq, openrouter). Triggers independent credential resolution.

Fallback chain

per-task field  >>>  delegation config  >>>  parent agent model

If neither field is specified, behavior is identical to before — fully backward compatible.

Implementation

_resolve_task_credentials (new helper):

  • Fast path: same provider or no provider — reuses delegation credentials with model swapped in.
  • Slow path: different provider — resolves a fresh credential bundle via _resolve_delegation_credentials.
  • Returns an error string on failure instead of raising.

Dispatch loop change: before _build_child_agent, checks for per-task fields and resolves credentials when present. Falls through to delegation credentials otherwise.

Files changed

File Change
tools/delegate_tool.py Schema: model/provider properties on task items. Dispatch: per-task credential resolution. New _resolve_task_credentials helper.
tests/tools/test_delegate_task_model_override.py 14 new tests.

Tests

Suite Result
New tests 14 passed
Existing test_delegate.py 140 passed
Total 154 passed, 0 failed

Example

{
  "tasks": [
    {
      "goal": "Research topic X",
      "model": "gemini-2.5-flash",
      "provider": "google"
    },
    {
      "goal": "Write summary",
      "model": "glm-5-turbo"
    },
    {
      "goal": "Review code"
    }
  ]
}

Task 1 routes to Google Gemini. Task 2 overrides the model but keeps the current provider. Task 3 inherits everything as usual.

Closes #18591

Add optional model and provider fields to individual tasks in delegate_tool batch mode. Fallback chain: per-task field > delegation config > parent model.

Closes NousResearch#18591
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have tool/delegate Subagent delegation labels Jun 9, 2026
@ricardocamiloconsir

Copy link
Copy Markdown

Multi-Provider Code Review — PR #43134

PR: feat(delegate_tool): per-task model/provider override
Author: rafaumeu | +89/-5 (prod), +534 (tests)
Reviewers: 8 independent reviews (DeepSeek, xAI/Grok, Groq/Llama, NVIDIA/Qwen, GitHub/GPT-4.1, Google/Gemini, Anthropic/Claude, Mistral)
Consolidated by: GLM-5.1 via ZAI provider


Verdict: APPROVED WITH WARNINGS

The implementation is well-structured, backward-compatible, and follows existing codebase patterns. All 8 reviewers converge on the same core assessment: the feature is sound but has a confirmed bug and several design-level concerns that warrant attention. No security blockers. One correctness bug (stale reference) and one design concern (batch abort semantics) should be addressed before or shortly after merge.


Consensus Items (all 8 reviewers agree)

  • Credential handling is correct and safe. Shallow copy via dict(delegation_creds) is sufficient because the credential dict is architecturally flat (all string values). No cross-task mutation risk. — confidence: HIGH
  • Broad except Exception at line 119 needs narrowing or logging. All 8 reviewers flag this. It catches TypeError, AttributeError, ImportError, etc. and silently converts them to user-facing error strings, masking genuine bugs. At minimum, add logger.exception() before returning the error string; ideally narrow to (ValueError, KeyError, LookupError). — confidence: HIGH
  • dict | str return type is a code smell. All 8 reviewers note that _resolve_task_credentials returning a credential dict on success or an error string on failure, with the caller discriminating via isinstance(task_creds, str), is fragile and not Pythonic. 6/8 reviewers specifically suggest replacing it with a custom exception (TaskCredentialError(ValueError)) to align with the existing _resolve_delegation_credentials pattern. — confidence: HIGH
  • Fail-fast batch abort needs documentation or redesign. All 8 reviewers flag that a credential resolution failure for task N in a batch immediately returns tool_error(...) and aborts the entire batch, orphaning already-spawned children from tasks 0..N-1. This should either be documented explicitly in the docstring or restructured (pre-validate all tasks, or collect errors and continue). — confidence: HIGH
  • Naming is clear and consistent. _resolve_task_credentials, effective_model, task_provider, task_creds all follow codebase conventions and are self-documenting. — confidence: HIGH
  • Architecture fit is clean. The fallback chain (per-task → delegation config → parent agent) resolves each field independently, reuses _resolve_delegation_credentials for the slow path, and adds zero overhead to the common case (no override). — confidence: HIGH
  • No security blockers. API keys are never logged or exposed in error messages. Thread safety is acceptable (each task constructs its own credential dict). — confidence: HIGH
  • Good unit test coverage (14 tests, 534 lines) covering fast path, slow path, error path, and fallback chain combinations. — confidence: HIGH

Warnings (majority flag)

  • Broad except Exception swallows unexpected errors — 8/8 reviewers flag this. DeepSeek rates it BLOCKING; the rest rate it MEDIUM/HIGH. Recommended fix: narrow to (ValueError, KeyError) or add logger.exception() as a safety net.
  • Fail-fast aborts entire batch on single-task error — 8/8 reviewers flag this. GitHub/GPT-4.1 calls it a SIGNIFICANT FINDING. Google/Gemini suggests pre-validating all task credentials before spawning any children as the simplest fix.
  • dict | str return type is an anti-pattern — 8/8 reviewers flag this. Anthropic/Claude calls it "the most significant design concern." Google/Gemini calls it "the most significant smell." Consensus suggestion: raise TaskCredentialError(ValueError) instead of returning error strings.
  • Missing integration tests — 6/8 reviewers note the 14 tests are all mock-based with no end-to-end validation of credential routing. A single integration test verifying the slow path with a stub provider would close this gap.
  • Schema changes not included in review diff — 3/8 reviewers (xAI, NVIDIA, Mistral) flag that _build_dynamic_schema_overrides changes are mentioned in the PR description but not in the diff. These should be verified to ensure model and provider are properly declared as optional string fields with descriptions.
  • Input validation gap — 3/8 reviewers (Google/Gemini, Anthropic/Claude, Mistral) flag that task_model and task_provider are taken directly from user-supplied task dicts with no sanitization. Whitespace-only strings like " " pass the truthiness check and enter credential resolution. Suggested fix: add .strip() or basic format validation.
  • Test boilerplate — 2/8 reviewers (xAI, Anthropic) note ~38 lines per test with repeated mock setup. Extracting shared fixtures (make_delegation_creds(), make_mock_parent()) would cut test code by ~40%.

Discordances (reviewers disagree)

  • Stale creds reference bug (unique to xAI/Grok). Reviewer 2 identified that override_acp_command still references creds.get("command") instead of task_creds.get("command") after per-task credential resolution. This means when a task overrides the provider, the ACP command falls back to the OLD provider's command, not the per-task one. No other reviewer caught this bug. If the stale reference is confirmed in the actual code, this is a correctness issue that should be fixed. — 1/8 identified; needs human verification of the actual codebase
  • Severity of except Exception differs. DeepSeek rates it BLOCKING (merge must wait); 7 others rate it MEDIUM or SHOULD FIX (merge with follow-up). The broader consensus leans toward SHOULD FIX, not a hard blocker.
  • Thread safety nuance. Google/Gemini raises a theoretical concern: if _build_child_agent mutates the credential dict in-place, concurrent tasks on the same provider could interfere. The other 7 reviewers assert thread safety is fine. This is a theoretical concern, not a confirmed bug — but worth a defensive .copy() comment.
  • Sentinel value: "" vs None. Anthropic/Claude suggests using None instead of "" as the sentinel for "not specified" (lines 39-40), arguing None is the canonical Python sentinel. The other 7 reviewers accept the or "" pattern with a brief comment. This is a style preference, not a correctness issue.
  • Pre-validation vs skip-on-error for batch mode. Google/Gemini suggests pre-validating all task credentials before spawning any children (two-pass approach). GitHub/GPT-4.1 and DeepSeek suggest collecting errors and continuing with remaining tasks. xAI/Grok suggests documenting fail-fast as intentional. No consensus on the preferred fix — this needs a team decision.

Action Items (prioritized)

  1. [BLOCKER] Verify and fix the stale creds reference in override_acp_command (xAI/Grok finding). Change creds.get("command")task_creds.get("command") if confirmed. Add a regression test for per-task provider override with different ACP commands.

  2. [SHOULD FIX] Narrow the except Exception at line 119. Either:

    • Catch only (ValueError, KeyError, LookupError) and let unexpected errors propagate, OR
    • Add logger.exception("Unexpected error resolving per-task provider %s", task_provider) before the return to preserve full tracebacks in logs.
      (7/8 reviewers would accept merge with a follow-up issue; 1/8 considers it blocking.)
  3. [SHOULD FIX] Document the fail-fast batch abort semantics. Add a comment near line 52 and a note in the docstring explaining that a per-task credential resolution failure aborts the entire batch. If the team decides to change this behavior, file a follow-up issue for pre-validation or partial-failure handling.

  4. [SHOULD FIX] Add logging to the broad exception handler. Even if except Exception is kept temporarily, logger.exception() ensures infrastructure errors are captured in observability tooling rather than silently lost.

  5. [NICE TO HAVE] Refactor _resolve_task_credentials to raise a TaskCredentialError(ValueError) instead of returning error strings. This aligns with the existing _resolve_delegation_credentials pattern and eliminates the isinstance(task_creds, str) check. 6/8 reviewers recommend this as a follow-up.

  6. [NICE TO HAVE] Add .strip() to model/provider inputs (lines 39-40) to handle whitespace-only strings: task_model = (t.get("model") or "").strip() or "".

  7. [NICE TO HAVE] Add 1-2 integration tests or contract tests that verify the full path from task dict → credential resolution → child agent construction. Verify the slow path calls _resolve_delegation_credentials with correct arguments.

  8. [NICE TO HAVE] Extract shared test fixtures to reduce boilerplate. A make_delegation_creds() and make_mock_parent() fixture would cut ~40% of test code.

  9. [NICE TO HAVE] Verify that _build_dynamic_schema_overrides properly declares model and provider as optional string fields with descriptions. This was not included in the review diff.


Positive Observations

  • Excellent fallback chain design. The per-task → delegation config → parent agent resolution is intuitive, composable, and resolves each field independently. All 8 reviewers praise this.
  • Zero overhead for existing callers. When neither task_model nor task_provider is set, the code falls through to task_creds = creds — a single else-branch. No performance regression for the common case.
  • Smart fast-path optimization. When the provider matches, the code reuses the existing credential dict with a shallow copy and model swap, avoiding unnecessary provider re-resolution.
  • Thorough docstrings. The fallback chain is documented inline with numbered priority levels. The return type contract is stated explicitly.
  • Strong DRY compliance. The slow path delegates to _resolve_delegation_credentials rather than duplicating provider resolution logic. The helper pattern is a good template for future per-task overrides.
  • Healthy test-to-code ratio. 534 lines of test for 89 lines of production code (6:1 ratio) demonstrates strong testing discipline.
  • Clean backward compatibility. Both fields are optional; omitting them preserves existing behavior exactly. No breaking changes to _build_child_agent's signature.

Reviewer Scorecard

Reviewer Verdict Unique Finding
DeepSeek WARNING Rates except Exception as BLOCKING; flags model name validation
xAI/Grok WARNING Identifies stale creds reference bug in override_acp_command
Groq/Llama WARNING Notes the pattern is reusable for future per-task overrides
NVIDIA/Qwen WARNING Suggests type alias TaskCredentialResult = dict | str as minimal fix
GitHub/GPT-4.1 WARNING Flags orphaned child agents from already-spawned tasks
Google/Gemini WARNING Suggests pre-validation two-pass approach; flags whitespace strings
Anthropic/Claude WARNING Suggests None sentinel instead of ""; most detailed exception refactor
Mistral WARNING Rates 7.5/10; notes schema diff gap

Consolidated by GLM-5.1 (ZAI provider) from 8 independent reviews on 2026-06-09.

@rafaumeu

Copy link
Copy Markdown
Author

Review Feedback Addressed

Thanks for the thorough multi-provider review! All actionable items have been fixed in PR #43185:

Priority Item Status
BLOCKER Stale credstask_creds reference Fixed + regression test
SHOULD FIX Narrow except Exception + logger.exception() Fixed
SHOULD FIX Fail-fast batch abort documented Fixed
NICE TO HAVE .strip() on model/provider inputs Fixed + test

16/16 tests passing. Supersedes this PR.

Follow-up items (not included, can be addressed post-merge):

  • Refactor _resolve_task_credentials to raise TaskCredentialError(ValueError) instead of returning error strings (6/8 reviewers recommended)
  • Extract shared test fixtures to reduce boilerplate
  • Add integration tests for the full credential routing path

@yimisunrise

Copy link
Copy Markdown

+1 This feature will help enable cost-effective multi-agent workflows by allowing task-appropriate model selection. Hope to see this land soon.

@rafaumeu

Copy link
Copy Markdown
Author

Closing in favor of #43185 which includes the original feature plus all review fixes (stale creds reference, narrowed exception handling, .strip() validation, documented fail-fast semantics). 16/16 tests passing.

@rafaumeu rafaumeu closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have tool/delegate Subagent delegation type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Per-task model override for delegate_task subagents

4 participants