Skip to content

v0.41.22.1 feat: brainstorm/lsd judge fixes (closes #1540 end-to-end)#1562

Merged
garrytan merged 7 commits into
masterfrom
garrytan/brainstorm-judge-fixes
May 27, 2026
Merged

v0.41.22.1 feat: brainstorm/lsd judge fixes (closes #1540 end-to-end)#1562
garrytan merged 7 commits into
masterfrom
garrytan/brainstorm-judge-fixes

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Closes the brainstorm/lsd judge failure that's been silently returning judge_failed: true since the calibration cold-start landed. Productionizes PR #1540 (closed as superseded), expands scope from 2 patched sites to 5 + extends the gateway resolver so the fix actually works end-to-end.

Per-commit:

  • feat(core) — new splitProviderModelId centralizer for pricing-side parsing
  • feat(gateway)model-resolver.ts:parseModelId extended to accept slash form (closes the load-bearing gap Codex caught: pricing fix alone left gateway throwing on slash-form mid-judge)
  • refactor — 5 pricing/config sites routed through the centralizer
  • feat(brainstorm) — judge maxTokens scales with idea count + respects each model's actual output cap via ANTHROPIC_OUTPUT_CAPS (closes the headline truncation bug)
  • chore — VERSION + CHANGELOG + TODOS
  • docs — README + CLAUDE.md + llms-full.txt sync

Test Coverage

[+] src/core/model-id.ts (NEW splitProviderModelId)        16 cases
[+] src/core/ai/model-resolver.ts (slash extension)        10 cases (new file)
[+] src/core/anthropic-pricing.ts (estimateMaxCostUsd)      7 cases (new file)
[+] src/core/budget/budget-tracker.ts (lookupPricing)       2 cases (extended)
[+] src/core/eval-contradictions/cost-tracker.ts            6 cases (new file)
[+] src/core/minions/batch-projection.ts                    2 cases (extended)
[+] src/core/model-config.ts:isAnthropicProvider            2 cases (extended)
[+] src/core/brainstorm/judges.ts (computeJudgeMaxTokens)  16 cases (new file)
                                                           ─────────
                                                            61 new test cases

Targeted suite: 190/190 pass. Full unit suite: pre-existing parallel-cross-contamination flakes only (schema-cli + 2 autopilot-cycle-handler tests, all pass standalone, present on master).

Pre-Landing Review

No issues found in code review. Codex adversarial caught 4 substantive issues during /ship; all absorbed into the shipped version:

  1. Original 32K maxTokens cap was unsafe for legacy 3.5 models (8K cap) → fixed via ANTHROPIC_OUTPUT_CAPS per-model map
  2. splitProviderModelId defensive contract was test-only → moved into the type signature
  3. Pricing fix alone would let BudgetTracker pass but gateway would still throw → extended model-resolver.ts:parseModelId to also accept slash form
  4. computeJudgeMaxTokens cap looked at modelOverride not the actual chat model → routes through getChatModel() when override is undefined

Plan Completion

All 13 plan-eng-review decisions (D1-D13) implemented. 3 follow-up TODOs filed in TODOS.md for the explicitly-deferred items (config-write normalization, non-Anthropic pricing tables, eval-contradictions duplicate pricing table consolidation).

Plan file: ~/.claude/plans/system-instruction-you-are-working-joyful-river.md

Verification Results

# Pre-fix this would silently exit with judge_failed:
gbrain brainstorm "what should I work on next" --max-cost 1
# Should now show: scored idea list with "passing N/M" > 0

# Pre-fix this would refuse to start with BudgetExhausted no_pricing,
# OR pass pricing then throw AIConfigError at gateway:
gbrain brainstorm "topic" --judge-model anthropic/claude-sonnet-4-6 --max-cost 1
# Should now run end-to-end to completion

E2E suite NOT required (no DB-shape changes; the unit suite covers the judge integration via the chatFn DI seam, and the new gateway resolver tests pin the resolveRecipe round-trip).

Documentation

  • README.md — added a v0.41.21.0 user-facing callout for the brainstorm judge_failed: true + slash-form --judge-model fixes, mirroring the v0.41.19.0 callout shape.
  • CLAUDE.md — added v0.41.21.0 annotations to two existing key-files entries (src/core/model-config.ts:isAnthropicProvider now routes through splitProviderModelId; src/core/brainstorm/{...,judges}.ts now scales maxTokens per-model via ANTHROPIC_OUTPUT_CAPS). Added NEW key-files entries for src/core/model-id.ts and the src/core/ai/model-resolver.ts:parseModelId slash-form extension.
  • llms-full.txt — regenerated to absorb the CLAUDE.md + README changes (passes the test/build-llms.test.ts drift guard).
  • CHANGELOG.md — v0.41.21.0 entry follows the ELI10 voice rules (lead with what users can now do, table comparing pre/post behavior, "what we caught and fixed before merging" section).
  • TODOS.md — 3 v0.41.21.0 follow-ups filed during ship.

TODOS

3 new P2/P3 items filed under "v0.41.21.0 brainstorm judge fix-wave follow-ups":

  • Config-write normalization (canonicalize provider IDs to : form on config write)
  • Non-Anthropic pricing tables (OpenAI / Gemini / OpenRouter pricing surface)
  • Eval-contradictions duplicate ANTHROPIC_PRICING consolidation

These were explicitly deferred per the plan's Step 0 scope decision (Option A: ship the centralizer + 5 sites cleanly, defer the full pricing-system DRY).

Credit

Thanks to @garrytan-agents whose original bug report and first-pass diff in PR #1540 drove the whole investigation. Their pricing-side patch is incorporated in spirit (centralized + extended to 3 more sites); commits route through the new centralizer so their literal lines aren't in this PR's history, but credit lives in the CHANGELOG entry and this PR body per the attribution discussion in plan-eng-review D10.

Test plan

  • bun run verify (28 checks green)
  • bun run typecheck clean
  • Targeted suite 190/190 pass
  • Full unit suite: only pre-existing master flakes fail (schema-cli + autopilot-cycle-handler)
  • Hand-verify after merge: gbrain brainstorm "topic" --judge-model anthropic/claude-sonnet-4-6 --max-cost 1
  • Close PR fix: brainstorm/lsd judge truncation + slash-prefix pricing lookup #1540 with link to this superseder

🤖 Generated with Claude Code

garrytan and others added 6 commits May 27, 2026 07:00
…sing

New pure helper in src/core/model-id.ts that splits provider:model,
provider/model, and bare model strings into a {provider, model} pair.
Defensive contract: null/undefined/empty/whitespace returns
{provider: null, model: ''}.

Will be wired into the 5 pricing/budget sites in the next commit.
Named splitProviderModelId (not parseModelId) to avoid the in-project
collision with the gateway-side src/core/ai/model-resolver.ts:parseModelId
which has a different bare-name contract.

Pinned by 16 cases in test/model-id.test.ts covering all separator
forms plus defensive + edge inputs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
src/core/ai/model-resolver.ts:parseModelId now accepts both
provider:model (colon) and provider/model (slash) forms. Colon wins
when both separators present so OpenRouter nested ids like
openrouter:anthropic/claude-sonnet-4.6 route as
{providerId: 'openrouter', modelId: 'anthropic/claude-sonnet-4.6'}.

Pre-fix: every gateway entry point (chat / embed / rerank) threw
AIConfigError 'missing a provider prefix' on slash form ids. That
meant CLI users running

  gbrain brainstorm --judge-model anthropic/claude-sonnet-4-6

would still fail mid-judge with AIConfigError even after pricing
was relaxed to accept slash form. Closes the end-to-end bug class.

Bare names without ANY separator still throw — gateway routing
always needs an explicit provider. Existing tests pinning that
throw (test/ai/capabilities.test.ts:43) stay green.

Pinned by 10 cases in test/ai/model-resolver-slash.test.ts
including a resolveRecipe round-trip that slash and colon forms
land on the same recipe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five sites had inline ':'-only provider-prefix splits that silently
missed slash-form ids. Centralizing through splitProviderModelId
closes the bug class:

  - src/core/anthropic-pricing.ts:estimateMaxCostUsd
  - src/core/budget/budget-tracker.ts:lookupPricing (closes the
    headline BudgetExhausted no_pricing failure on --max-cost +
    slash-form --judge-model)
  - src/core/eval-contradictions/cost-tracker.ts:pricingFor
    (legacy silent-Haiku fallback preserved per plan D9)
  - src/core/minions/batch-projection.ts (deleted bareModel inline
    helper; inlined splitProviderModelId at 2 call sites)
  - src/core/model-config.ts:isAnthropicProvider (silently fixed
    v0.31.12 subagent-guard bypass for slash-form Anthropic ids)

Test gates land together so any bisect step is green:

  - NEW test/anthropic-pricing.test.ts (7 cases including structural
    regression guard: every ANTHROPIC_PRICING key reachable via all
    three forms)
  - NEW test/eval-contradictions/cost-tracker-slash.test.ts (6 cases
    including legacy-Haiku-fallback pin)
  - EXTENDED test/batch-projection.test.ts (slash + double-separator
    cases)
  - EXTENDED test/model-config.serial.test.ts (2 slash-form
    isAnthropicProvider cases)
  - EXTENDED test/core/budget/budget-tracker.test.ts (2 slash + colon
    reserve() cases)

Behavior changes for slash-prefix ids only; bare and colon ids
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the hard-coded maxTokens: 4000 with computeJudgeMaxTokens
that scales with idea count and respects each model's actual output
cap.

Pre-fix: any judge call with 36+ ideas produced ~100 tokens/idea of
JSON that got truncated mid-output. parseJudgeJSON threw, orchestrator
surfaced judge_failed: true, all ideas saved unscored. Verified
failure mode on 72-idea fixture: 0/72 passing before, 39/72 after.

Formula: min(modelCap, max(LEGACY_MIN_MAX_TOKENS, ideaCount*150+500))

Named constants extracted at top of judges.ts:

  - TOKEN_BUDGET_PER_IDEA = 150 (1.5x headroom over observed ~100/idea)
  - TOKEN_BUDGET_ENVELOPE = 500 (JSON wrapper)
  - LEGACY_MIN_MAX_TOKENS = 4000 (pre-fix floor preserved for 1-idea)
  - MAX_OUTPUT_TOKENS_CEIL = 32_000 (fallback when model unknown)
  - ANTHROPIC_OUTPUT_CAPS (per-model: Opus 4.7 = 32K, Sonnet 4.6 /
    Haiku 4.5 = 64K, legacy 3.5 = 8K)

When the caller passes no modelOverride, the cap routes through the
gateway's actual configured chat model via getChatModel() so the
formula matches what chat() will use, not whatever the override
hints at. Pre-fix the undefined-override case fell back to 32K even
if the configured default was a legacy 8K model.

Pinned by 16 cases in test/brainstorm/judges-maxtokens.test.ts:
formula at 1/10/36/96/200/300 ideas, per-model cap binding (Haiku 3.5
8K, Opus 4.7 32K, Sonnet 4.6 64K), and integration via runJudge with
a stubbed chatFn that captures ChatOpts.maxTokens.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brainstorm judge fix-wave: closes #1540 end-to-end. parseModelId
centralizer + gateway resolver slash-form acceptance + per-model
maxTokens cap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLAUDE.md: add v0.41.21.0 annotations to brainstorm/judges + model-config
entries; add new key-files entry for src/core/model-id.ts (the shared
splitProviderModelId centralizer) and src/core/ai/model-resolver.ts
slash-form extension.

README.md: add user-facing callout for the brainstorm judge_failed +
slash-form pricing fix, mirroring the v0.41.19.0 callout shape.

llms-full.txt: regenerated to absorb the CLAUDE.md + README changes
(passes test/build-llms.test.ts drift guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master claimed both v0.41.21.0 (ops-fix-wave) and v0.41.22.0
(type-unification cathedral) while this branch was shipping.
Rebumped from v0.41.21.0 → v0.41.22.1 across VERSION, package.json,
CHANGELOG header + in-body refs, and TODOS section header + body refs.
No code changes — only the version metadata.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan changed the title v0.41.21.0 feat: brainstorm/lsd judge fixes (closes #1540 end-to-end) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes #1540 end-to-end) May 27, 2026
@garrytan garrytan merged commit 127842e into master May 27, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant