fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline by skinred78 · Pull Request #490 · HKUDS/DeepTutor

skinred78 · 2026-05-17T07:37:17Z

Fixes #489.

Summary

Six bugs that combined to break the Visualize capability on Gemini 2.5 Flash (and similar thinking-by-default models). Each is independently useful, but the user-visible symptom — Visualize output randomly truncated at ~370 chars — needs all of them.

1. Gemini 2.5/3.x reasoning-tokens default (root cause)

Gemini 2.5+ models burn most of max_tokens on internal "thinking" tokens by default. With max_tokens=4096, ~3900 tokens went to reasoning and ~160 came out as visible content, causing finish_reason=length on every multi-step pipeline (Visualize codegen + review, Deep Solve writing, etc.). See #489 for the reproducer with curl directly against Gemini's OpenAI-compat endpoint.

Default reasoning_effort="none" for Gemini 2.5/3.x when the caller doesn't specify, in all three execution paths:

provider_core/openai_compat_provider.py:_build_kwargs (live path)
executors.py:sdk_complete / sdk_stream (SDK path)
cloud_provider.py:_openai_complete / _openai_stream (aiohttp fallback)

Callers that want thinking can still opt in via explicit reasoning_effort.

2. `visualize` capability had no `agents.yaml` entry

get_agent_params("visualize") silently fell through to the 4096-token default because there was no entry in DEFAULT_AGENTS_SETTINGS or loader.py:section_map. Added both, with a 16384-token budget appropriate for full HTML pages.

3. Review stage crashed hard on JSON parse failure

ReviewAgent.process does ReviewResult.model_validate(extract_json_object(response)). When the model returned prose instead of JSON (common with large SVGs that don't escape cleanly into a JSON string field), the parse raised and killed the entire Visualize turn — the user saw the streamed SVG draft and then a traceback. Wrapped pipeline.run_review() in try/except so a parse failure falls back to the unreviewed draft and the user still gets a rendered result.

4. Codegen output not trimmed to the root tag

Models often wrap SVG/HTML in prose ("Here you go: … Enjoy!") or emit a closing code fence on the same line as </html>, which extract_code_block's regex (requiring a leading \n before the fence) doesn't strip. Added defensive root-tag trimming for render_type=="svg" and render_type=="html" so the renderer always sees a clean payload.

Test plan

Verified end-to-end on Gemini 2.5 Flash via the CLI and via headless Playwright:

Long-division HTML test prompt: 22 KB self-contained interactive page (was: 0 bytes / crash on dev).
All step prompts read correctly (e.g. "How many whole times does 6 go into 7?" for the first digit).
Walking the full 7852 ÷ 6 algorithm step by step terminates with Quotient: 1308, Remainder: 4.
Wrong-answer retry preserves prior progress; 3 wrong attempts reveals the answer and auto-advances (graceful-fallback path exercised).
Simple SVG (e.g. "draw 8 cookies") still renders unchanged — no regression on the happy path.
The defensive root-tag trim is a no-op when codegen already returns a clean tag.
Maintainer: please run pre-commit run --all-files and CI as a sanity check; my local environment doesn't have all the lint deps configured.

Six bugs that combined to break the Visualize capability on Gemini 2.5 Flash (and similar thinking-by-default models). Each is independently useful, but the user-visible symptom — Visualize "kind of works but output is randomly truncated at ~370 chars" — needs all of them. 1. Gemini 2.5/3.x reasoning-tokens default (root cause) Gemini 2.5+ models burn most of `max_tokens` on internal "thinking" tokens by default. With `max_tokens=4096`, ~3900 went to reasoning and only ~160 came out as actual content, causing finish_reason=length on every multi-step pipeline (Visualize codegen + review, Deep Solve, anything that asks for a structured output beyond a sentence). Default `reasoning_effort="none"` for Gemini 2.5/3.x models when the caller doesn't specify, in all three execution paths: - provider_core/openai_compat_provider.py:_build_kwargs (live path) - executors.py:sdk_complete / sdk_stream (legacy SDK path) - cloud_provider.py:_openai_complete / _openai_stream (aiohttp fallback) 2. visualize capability had no agents.yaml entry `get_agent_params("visualize")` silently fell through to the 4096 default because there was no section_map entry and no DEFAULT_AGENTS_SETTINGS entry. Added both, with a 16384-token budget appropriate for full HTML pages. 3. Review stage crashed hard on JSON parse failure `ReviewAgent.process` does `ReviewResult.model_validate(extract_json_object(response))`. When the model returned prose instead of JSON (common with large SVGs that the model can't escape into a JSON string), the parse raised and killed the entire turn. Wrapped pipeline.run_review() in try/except so review failure falls back to the unreviewed draft and the user still gets a rendered result. 4. Codegen output not trimmed to the root tag Models often wrap SVG/HTML in prose ("Here you go: <svg>…</svg> Enjoy!") or emit a closing code fence on the same line as `</html>`, which `extract_code_block`'s regex (requiring a leading \n before the fence) doesn't strip. Added defensive root-tag trimming for render_type=="svg" and render_type=="html". Verified end-to-end on Gemini 2.5 Flash via the CLI and headless Playwright: full 22 KB long-division HTML page, no truncation, all interactive elements present, multi-step walkthrough completes correctly (7852 ÷ 6 → 1308 R 4).

pancacake · 2026-05-28T09:45:17Z

Thanks for your contribution!

Follow-up to #490. The PR inlined the "disable thinking for Gemini 2.5/3" gate in 5 places across 3 files. This commit collapses them to one registry and three thin call sites. Changes: * `services/llm/reasoning_params.py`: new `_PROVIDER_DEFAULT_OFF_PATTERNS` registry + `default_reasoning_effort_for(provider, model)` public helper. `build_openai_compatible_reasoning_kwargs` now consults the registry, so the openai-compat path (which lost its inline gate during the #490 merge conflict resolution) is restored via the helper. * `services/llm/executors.py` (sdk_complete + sdk_stream) and `services/llm/cloud_provider.py` (_openai_complete + _openai_stream) now call `default_reasoning_effort_for(...)` instead of inlining the ('gemini-2.5', 'gemini-3') startswith check. * Use substring (not startswith) match so `models/gemini-2.5-flash` is also covered — some OpenAI-compat clients prefix model ids with `models/`. * `services/config/loader.py:get_agent_params`: when a module's section is missing from the user's stale `agents.yaml`, fall back through `DEFAULT_AGENTS_SETTINGS` before the global `(0.5, 4096)` default. This lets the `capabilities.visualize` default (`max_tokens=16384` from #490) reach existing installs, not just fresh ones. * `capabilities/visualize.py`: hoist the duplicated lazy `from deeptutor.agents.visualize.models import ReviewResult` import from two branches into the top of `run()`. Tests: * `tests/services/llm/test_reasoning_params.py` — 17 new cases covering Gemini 2.5/3 + `models/` prefix + case-insensitivity + the legacy Gemini 1.5/2.0 / other-provider untouched paths + the explicit-override takes-precedence rule. * All 107 tests in `tests/services/llm/` still pass; 5 pre-existing `test_chat_params_config` failures (8192 vs 8000 drift) are unrelated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pancacake merged commit 44564b0 into HKUDS:dev May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490

fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490
pancacake merged 1 commit into
HKUDS:devfrom
skinred78:fix/gemini-2.5-thinking-tokens

skinred78 commented May 17, 2026

Uh oh!

pancacake commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

skinred78 commented May 17, 2026

Summary

1. Gemini 2.5/3.x reasoning-tokens default (root cause)

2. visualize capability had no agents.yaml entry

3. Review stage crashed hard on JSON parse failure

4. Codegen output not trimmed to the root tag

Test plan

Uh oh!

pancacake commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `visualize` capability had no `agents.yaml` entry