Skip to content

fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490

Merged
pancacake merged 1 commit into
HKUDS:devfrom
skinred78:fix/gemini-2.5-thinking-tokens
May 28, 2026
Merged

fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490
pancacake merged 1 commit into
HKUDS:devfrom
skinred78:fix/gemini-2.5-thinking-tokens

Conversation

@skinred78

Copy link
Copy Markdown

Fixes #489.

Summary

Six bugs that combined to break the Visualize capability on Gemini 2.5 Flash (and similar thinking-by-default models). Each is independently useful, but the user-visible symptom — Visualize output randomly truncated at ~370 chars — needs all of them.

1. Gemini 2.5/3.x reasoning-tokens default (root cause)

Gemini 2.5+ models burn most of max_tokens on internal "thinking" tokens by default. With max_tokens=4096, ~3900 tokens went to reasoning and ~160 came out as visible content, causing finish_reason=length on every multi-step pipeline (Visualize codegen + review, Deep Solve writing, etc.). See #489 for the reproducer with curl directly against Gemini's OpenAI-compat endpoint.

Default reasoning_effort="none" for Gemini 2.5/3.x when the caller doesn't specify, in all three execution paths:

  • provider_core/openai_compat_provider.py:_build_kwargs (live path)
  • executors.py:sdk_complete / sdk_stream (SDK path)
  • cloud_provider.py:_openai_complete / _openai_stream (aiohttp fallback)

Callers that want thinking can still opt in via explicit reasoning_effort.

2. visualize capability had no agents.yaml entry

get_agent_params("visualize") silently fell through to the 4096-token default because there was no entry in DEFAULT_AGENTS_SETTINGS or loader.py:section_map. Added both, with a 16384-token budget appropriate for full HTML pages.

3. Review stage crashed hard on JSON parse failure

ReviewAgent.process does ReviewResult.model_validate(extract_json_object(response)). When the model returned prose instead of JSON (common with large SVGs that don't escape cleanly into a JSON string field), the parse raised and killed the entire Visualize turn — the user saw the streamed SVG draft and then a traceback. Wrapped pipeline.run_review() in try/except so a parse failure falls back to the unreviewed draft and the user still gets a rendered result.

4. Codegen output not trimmed to the root tag

Models often wrap SVG/HTML in prose ("Here you go: … Enjoy!") or emit a closing code fence on the same line as </html>, which extract_code_block's regex (requiring a leading \n before the fence) doesn't strip. Added defensive root-tag trimming for render_type=="svg" and render_type=="html" so the renderer always sees a clean payload.

Test plan

Verified end-to-end on Gemini 2.5 Flash via the CLI and via headless Playwright:

  • Long-division HTML test prompt: 22 KB self-contained interactive page (was: 0 bytes / crash on dev).

  • All step prompts read correctly (e.g. "How many whole times does 6 go into 7?" for the first digit).

  • Walking the full 7852 ÷ 6 algorithm step by step terminates with Quotient: 1308, Remainder: 4.

  • Wrong-answer retry preserves prior progress; 3 wrong attempts reveals the answer and auto-advances (graceful-fallback path exercised).

  • Simple SVG (e.g. "draw 8 cookies") still renders unchanged — no regression on the happy path.

  • The defensive root-tag trim is a no-op when codegen already returns a clean tag.

  • Maintainer: please run pre-commit run --all-files and CI as a sanity check; my local environment doesn't have all the lint deps configured.

Six bugs that combined to break the Visualize capability on Gemini 2.5
Flash (and similar thinking-by-default models). Each is independently
useful, but the user-visible symptom — Visualize "kind of works but
output is randomly truncated at ~370 chars" — needs all of them.

1. Gemini 2.5/3.x reasoning-tokens default (root cause)

   Gemini 2.5+ models burn most of `max_tokens` on internal "thinking"
   tokens by default. With `max_tokens=4096`, ~3900 went to reasoning
   and only ~160 came out as actual content, causing finish_reason=length
   on every multi-step pipeline (Visualize codegen + review, Deep Solve,
   anything that asks for a structured output beyond a sentence).

   Default `reasoning_effort="none"` for Gemini 2.5/3.x models when the
   caller doesn't specify, in all three execution paths:
   - provider_core/openai_compat_provider.py:_build_kwargs (live path)
   - executors.py:sdk_complete / sdk_stream (legacy SDK path)
   - cloud_provider.py:_openai_complete / _openai_stream (aiohttp fallback)

2. visualize capability had no agents.yaml entry

   `get_agent_params("visualize")` silently fell through to the 4096
   default because there was no section_map entry and no
   DEFAULT_AGENTS_SETTINGS entry. Added both, with a 16384-token budget
   appropriate for full HTML pages.

3. Review stage crashed hard on JSON parse failure

   `ReviewAgent.process` does `ReviewResult.model_validate(extract_json_object(response))`.
   When the model returned prose instead of JSON (common with large SVGs
   that the model can't escape into a JSON string), the parse raised and
   killed the entire turn. Wrapped pipeline.run_review() in try/except
   so review failure falls back to the unreviewed draft and the user
   still gets a rendered result.

4. Codegen output not trimmed to the root tag

   Models often wrap SVG/HTML in prose ("Here you go: <svg>…</svg>
   Enjoy!") or emit a closing code fence on the same line as `</html>`,
   which `extract_code_block`'s regex (requiring a leading \n before
   the fence) doesn't strip. Added defensive root-tag trimming for
   render_type=="svg" and render_type=="html".

Verified end-to-end on Gemini 2.5 Flash via the CLI and headless
Playwright: full 22 KB long-division HTML page, no truncation, all
interactive elements present, multi-step walkthrough completes
correctly (7852 ÷ 6 → 1308 R 4).
@pancacake

Copy link
Copy Markdown
Collaborator

Thanks for your contribution!

@pancacake pancacake merged commit 44564b0 into HKUDS:dev May 28, 2026
pancacake added a commit that referenced this pull request May 28, 2026
Follow-up to #490. The PR inlined the "disable thinking for Gemini 2.5/3"
gate in 5 places across 3 files. This commit collapses them to one
registry and three thin call sites.

Changes:

* `services/llm/reasoning_params.py`: new `_PROVIDER_DEFAULT_OFF_PATTERNS`
  registry + `default_reasoning_effort_for(provider, model)` public helper.
  `build_openai_compatible_reasoning_kwargs` now consults the registry, so
  the openai-compat path (which lost its inline gate during the #490 merge
  conflict resolution) is restored via the helper.

* `services/llm/executors.py` (sdk_complete + sdk_stream) and
  `services/llm/cloud_provider.py` (_openai_complete + _openai_stream) now
  call `default_reasoning_effort_for(...)` instead of inlining the
  ('gemini-2.5', 'gemini-3') startswith check.

* Use substring (not startswith) match so `models/gemini-2.5-flash` is
  also covered — some OpenAI-compat clients prefix model ids with `models/`.

* `services/config/loader.py:get_agent_params`: when a module's section is
  missing from the user's stale `agents.yaml`, fall back through
  `DEFAULT_AGENTS_SETTINGS` before the global `(0.5, 4096)` default. This
  lets the `capabilities.visualize` default (`max_tokens=16384` from #490)
  reach existing installs, not just fresh ones.

* `capabilities/visualize.py`: hoist the duplicated lazy
  `from deeptutor.agents.visualize.models import ReviewResult` import
  from two branches into the top of `run()`.

Tests:

* `tests/services/llm/test_reasoning_params.py` — 17 new cases covering
  Gemini 2.5/3 + `models/` prefix + case-insensitivity + the legacy
  Gemini 1.5/2.0 / other-provider untouched paths + the explicit-override
  takes-precedence rule.
* All 107 tests in `tests/services/llm/` still pass; 5 pre-existing
  `test_chat_params_config` failures (8192 vs 8000 drift) are unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants