Skip to content

[Bug]: Visualize (and other capabilities) silently truncate on Gemini 2.5 / 3.x due to default thinking tokens #489

@skinred78

Description

@skinred78

Summary

Gemini 2.5 / 3.x models ship with thinking enabled by default. When DeepTutor sends max_tokens=4096 (the Visualize default) or even larger budgets, Gemini spends most of that budget on internal reasoning tokens and the model returns truncated output with finish_reason=length — often before any visible content has streamed. Pipelines that need structured output (Visualize codegen + review, Deep Solve writing, etc.) appear to half-produce results and either crash downstream parsers or render unusable artifacts.

Repro

  1. Configure DeepTutor with LLM_BINDING=gemini, LLM_MODEL=gemini-2.5-flash (any Gemini 2.5/3.x flash-tier model reproduces).
  2. In the chat UI, switch the mode to Visualize and pick any non-trivial prompt — e.g.

    Build an interactive long-division tutor for Year 7 that teaches the algorithm one digit at a time. Use the problem 7852 / 6. Walk me through divide / multiply / subtract / bring down for each digit with input boxes and a Check button.

  3. Observe: stream halts mid-output (HTML cut off mid-CSS or mid-script, SVG cut off mid-tag). Codegen returns ~300-2000 chars and then stops.

Reproducible against the Gemini OpenAI-compat endpoint independently:

curl -s 'https://generativelanguage.googleapis.com/v1beta/openai/chat/completions' \
  -H \"Authorization: Bearer \$KEY\" -H 'Content-Type: application/json' \
  -d '{\"model\":\"gemini-2.5-flash\",\"stream\":false,\"max_tokens\":4096,
       \"messages\":[{\"role\":\"user\",
       \"content\":\"Write a 50-line SVG of cookies, just the SVG.\"}]}' \
  | jq '{finish:.choices[0].finish_reason, len:(.choices[0].message.content|length), usage}'

Result with thinking enabled (default):
```
{ "finish": "length", "len": 360, "usage": {"prompt_tokens":24, "completion_tokens":164, "total_tokens":4116} }
```

Result with reasoning_effort: \"none\" added to the request:
```
{ "finish": "stop", "len": 2594, "usage": {"prompt_tokens":24, "completion_tokens":1426, "total_tokens":1450} }
```

total_tokens - prompt - completion in the first call (~3900 tokens) is exactly the missing reasoning-token budget that ate the response.

Why DeepTutor doesn't currently mitigate this

OpenAICompatProvider._build_kwargs (the live path) only auto-injects reasoning_effort=\"high\" when a model matches spec.reasoning_model_patterns. The Gemini ProviderSpec doesn't set those patterns. There's no default-down behavior for thinking-by-default models, so they silently consume the budget.

Also contributing (already filed as related smaller bugs but worth noting):

  • The Visualize capability has no entry in agents.yaml / DEFAULT_AGENTS_SETTINGS / loader.py:section_map, so it silently uses the 4096-token default instead of a higher budget appropriate for HTML pages.
  • ReviewAgent.process raises a JSONDecodeError that propagates up and kills the whole Visualize turn when the model returns prose instead of strict JSON (downstream consequence of the truncation above).

Proposed fix

Disable thinking by default for Gemini 2.5/3.x in the three execution paths (openai_compat_provider, executors, cloud_provider) — caller can still opt in via explicit reasoning_effort. Plus the three smaller adjacent fixes (agents.yaml entry, review-stage graceful fallback, codegen tag-trim).

PR coming.

Environment

  • DeepTutor: current dev branch (cade789)
  • LLM_BINDING=gemini, LLM_MODEL=gemini-2.5-flash
  • Setup Tour install on macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions