Summary
Gemini 2.5 / 3.x models ship with thinking enabled by default. When DeepTutor sends max_tokens=4096 (the Visualize default) or even larger budgets, Gemini spends most of that budget on internal reasoning tokens and the model returns truncated output with finish_reason=length — often before any visible content has streamed. Pipelines that need structured output (Visualize codegen + review, Deep Solve writing, etc.) appear to half-produce results and either crash downstream parsers or render unusable artifacts.
Repro
- Configure DeepTutor with
LLM_BINDING=gemini, LLM_MODEL=gemini-2.5-flash (any Gemini 2.5/3.x flash-tier model reproduces).
- In the chat UI, switch the mode to Visualize and pick any non-trivial prompt — e.g.
Build an interactive long-division tutor for Year 7 that teaches the algorithm one digit at a time. Use the problem 7852 / 6. Walk me through divide / multiply / subtract / bring down for each digit with input boxes and a Check button.
- Observe: stream halts mid-output (HTML cut off mid-CSS or mid-script, SVG cut off mid-tag). Codegen returns ~300-2000 chars and then stops.
Reproducible against the Gemini OpenAI-compat endpoint independently:
curl -s 'https://generativelanguage.googleapis.com/v1beta/openai/chat/completions' \
-H \"Authorization: Bearer \$KEY\" -H 'Content-Type: application/json' \
-d '{\"model\":\"gemini-2.5-flash\",\"stream\":false,\"max_tokens\":4096,
\"messages\":[{\"role\":\"user\",
\"content\":\"Write a 50-line SVG of cookies, just the SVG.\"}]}' \
| jq '{finish:.choices[0].finish_reason, len:(.choices[0].message.content|length), usage}'
Result with thinking enabled (default):
```
{ "finish": "length", "len": 360, "usage": {"prompt_tokens":24, "completion_tokens":164, "total_tokens":4116} }
```
Result with reasoning_effort: \"none\" added to the request:
```
{ "finish": "stop", "len": 2594, "usage": {"prompt_tokens":24, "completion_tokens":1426, "total_tokens":1450} }
```
total_tokens - prompt - completion in the first call (~3900 tokens) is exactly the missing reasoning-token budget that ate the response.
Why DeepTutor doesn't currently mitigate this
OpenAICompatProvider._build_kwargs (the live path) only auto-injects reasoning_effort=\"high\" when a model matches spec.reasoning_model_patterns. The Gemini ProviderSpec doesn't set those patterns. There's no default-down behavior for thinking-by-default models, so they silently consume the budget.
Also contributing (already filed as related smaller bugs but worth noting):
- The Visualize capability has no entry in
agents.yaml / DEFAULT_AGENTS_SETTINGS / loader.py:section_map, so it silently uses the 4096-token default instead of a higher budget appropriate for HTML pages.
ReviewAgent.process raises a JSONDecodeError that propagates up and kills the whole Visualize turn when the model returns prose instead of strict JSON (downstream consequence of the truncation above).
Proposed fix
Disable thinking by default for Gemini 2.5/3.x in the three execution paths (openai_compat_provider, executors, cloud_provider) — caller can still opt in via explicit reasoning_effort. Plus the three smaller adjacent fixes (agents.yaml entry, review-stage graceful fallback, codegen tag-trim).
PR coming.
Environment
- DeepTutor: current
dev branch (cade789)
- LLM_BINDING=gemini, LLM_MODEL=gemini-2.5-flash
- Setup Tour install on macOS
Summary
Gemini 2.5 / 3.x models ship with thinking enabled by default. When DeepTutor sends
max_tokens=4096(the Visualize default) or even larger budgets, Gemini spends most of that budget on internal reasoning tokens and the model returns truncated output withfinish_reason=length— often before any visible content has streamed. Pipelines that need structured output (Visualize codegen + review, Deep Solve writing, etc.) appear to half-produce results and either crash downstream parsers or render unusable artifacts.Repro
LLM_BINDING=gemini,LLM_MODEL=gemini-2.5-flash(any Gemini 2.5/3.x flash-tier model reproduces).Reproducible against the Gemini OpenAI-compat endpoint independently:
Result with thinking enabled (default):
```
{ "finish": "length", "len": 360, "usage": {"prompt_tokens":24, "completion_tokens":164, "total_tokens":4116} }
```
Result with
reasoning_effort: \"none\"added to the request:```
{ "finish": "stop", "len": 2594, "usage": {"prompt_tokens":24, "completion_tokens":1426, "total_tokens":1450} }
```
total_tokens - prompt - completionin the first call (~3900 tokens) is exactly the missing reasoning-token budget that ate the response.Why DeepTutor doesn't currently mitigate this
OpenAICompatProvider._build_kwargs(the live path) only auto-injectsreasoning_effort=\"high\"when a model matchesspec.reasoning_model_patterns. The GeminiProviderSpecdoesn't set those patterns. There's no default-down behavior for thinking-by-default models, so they silently consume the budget.Also contributing (already filed as related smaller bugs but worth noting):
agents.yaml/DEFAULT_AGENTS_SETTINGS/loader.py:section_map, so it silently uses the 4096-token default instead of a higher budget appropriate for HTML pages.ReviewAgent.processraises a JSONDecodeError that propagates up and kills the whole Visualize turn when the model returns prose instead of strict JSON (downstream consequence of the truncation above).Proposed fix
Disable thinking by default for Gemini 2.5/3.x in the three execution paths (
openai_compat_provider,executors,cloud_provider) — caller can still opt in via explicitreasoning_effort. Plus the three smaller adjacent fixes (agents.yaml entry, review-stage graceful fallback, codegen tag-trim).PR coming.
Environment
devbranch (cade789)