Problem or Use Case
The GLM 5 series (and to a lesser degree the 4.7) has been trained in a ‘preserved thinking' mode for agentic use.
In that mode, interleaved reasoning traces in between tool calls are retained to serve as a short term memory for persistence of intent in long tool call chains, it's supposed to significantly improve long session capabilities and while it does increase (cached) input tokens and context use, if it does limit the number of necessary tool calls necessary to a task, it’s likely worthwhile.
While this is a feature that is intrinsic to the model, I’d only activate it on inference providers that publicize supporting it, it may conflict with prefix caching hit rate on providers who don’t.
Proposed Solution
Default preserved thinking on supported models if they are directly provided by z.ai.
Alternatives Considered
No response
Feature Type
Performance / reliability
Scope
Small (single file, < 50 lines)
Contribution
Debug Report (optional)
Problem or Use Case
The GLM 5 series (and to a lesser degree the 4.7) has been trained in a ‘preserved thinking' mode for agentic use.
In that mode, interleaved reasoning traces in between tool calls are retained to serve as a short term memory for persistence of intent in long tool call chains, it's supposed to significantly improve long session capabilities and while it does increase (cached) input tokens and context use, if it does limit the number of necessary tool calls necessary to a task, it’s likely worthwhile.
While this is a feature that is intrinsic to the model, I’d only activate it on inference providers that publicize supporting it, it may conflict with prefix caching hit rate on providers who don’t.
Proposed Solution
Default preserved thinking on supported models if they are directly provided by z.ai.
Alternatives Considered
No response
Feature Type
Performance / reliability
Scope
Small (single file, < 50 lines)
Contribution
Debug Report (optional)