Logging cleanup#715
Conversation
There was a problem hiding this comment.
Code Review
This pull request disables the KV cache across all configuration objects, including nested sub-configs, when gradient checkpointing is enabled. The review feedback suggests removing the redundant try-except block around the import of PretrainedConfig, as transformers is a required dependency, and simplifying the configuration traversal.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| from transformers import PretrainedConfig | ||
| except Exception: | ||
| PretrainedConfig = None | ||
| _seen = set() | ||
| _stack = [model.config] | ||
| while _stack: | ||
| _cfg = _stack.pop() | ||
| if _cfg is None or id(_cfg) in _seen: | ||
| continue | ||
| _seen.add(id(_cfg)) | ||
| if getattr(_cfg, "use_cache", None): | ||
| _cfg.use_cache = False | ||
| if PretrainedConfig is not None: | ||
| for _sub in vars(_cfg).values(): | ||
| if isinstance(_sub, PretrainedConfig): | ||
| _stack.append(_sub) |
There was a problem hiding this comment.
Since transformers is a required dependency of the project, wrapping its import in a try...except block is redundant. We can import PretrainedConfig directly. Additionally, we can simplify the traversal and use hasattr to safely disable use_cache.
| try: | |
| from transformers import PretrainedConfig | |
| except Exception: | |
| PretrainedConfig = None | |
| _seen = set() | |
| _stack = [model.config] | |
| while _stack: | |
| _cfg = _stack.pop() | |
| if _cfg is None or id(_cfg) in _seen: | |
| continue | |
| _seen.add(id(_cfg)) | |
| if getattr(_cfg, "use_cache", None): | |
| _cfg.use_cache = False | |
| if PretrainedConfig is not None: | |
| for _sub in vars(_cfg).values(): | |
| if isinstance(_sub, PretrainedConfig): | |
| _stack.append(_sub) | |
| from transformers import PretrainedConfig | |
| _seen = set() | |
| _stack = [model.config] | |
| while _stack: | |
| _cfg = _stack.pop() | |
| if _cfg is None or id(_cfg) in _seen: | |
| continue | |
| _seen.add(id(_cfg)) | |
| if hasattr(_cfg, "use_cache"): | |
| _cfg.use_cache = False | |
| for _sub in vars(_cfg).values(): | |
| if isinstance(_sub, PretrainedConfig): | |
| _stack.append(_sub) |
References
- Avoid using try...except ImportError for libraries that are required dependencies of the project, as the check is redundant.
There was a problem hiding this comment.
Switched the import to the PreTrainedConfig first pattern used elsewhere in the repo in 85a07af, which also removes the redundant guard. Kept the truthy getattr check rather than hasattr on purpose: hasattr would coerce use_cache=None to False, and None means defer to the model default, so flipping it would be a semantic change. This is pinned by tests/test_training_utils_use_cache.py::test_falsy_use_cache_preserved.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b23cbf0f9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # KV cache is unused under gradient checkpointing; disable it on every config. | ||
| if use_gradient_checkpointing in (True, "unsloth") and getattr(model, "config", None) is not None: | ||
| try: | ||
| from transformers import PretrainedConfig |
There was a problem hiding this comment.
Fall back to PreTrainedConfig for nested cache disabling
In Transformers 4.57+/5.x environments supported by this repo, the config class may be exported as PreTrainedConfig rather than PretrainedConfig (the existing helpers already use that fallback). With only this legacy import, the except path sets PretrainedConfig = None, so the loop never descends into composite configs such as text_config; those nested decoder configs can keep use_cache=True under gradient checkpointing, leaving the warning/cache behavior this block is meant to suppress for VLM/composite models.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 85a07af. The block now tries PreTrainedConfig first and falls back to the legacy PretrainedConfig name, matching hf_utils.py, patching_utils.py and empty_model.py, and the None fallback that silently skipped nested traversal is gone. Added tests/test_training_utils_use_cache.py which covers the nested composite config case (Gemma3Config text_config) on CPU.
Align the gradient checkpointing use_cache block with the repo's PreTrainedConfig-first import convention (hf_utils, patching_utils, empty_model) and drop the dead None fallback that skipped nested config traversal. Add CPU tests pinning the contract: top-level and nested composite configs flip to False under gradient checkpointing, None/False values are preserved, non-config attachments are ignored, and self-referencing config graphs terminate.
|
Validation summary for this PR (now includes 85a07af: PreTrainedConfig-first import plus CPU tests). What was checked
Loss parity, 61 steps, LoRA 4 bit, bsz 2 x ga 3, seed 3407, B200
Gemma losses and grad norms are bit identical across all 61 steps; even the nan grad norm log positions match (steps 3, 7, 10, 13, 18, 39 on both, a pre-existing logging artifact in 4 bit gemma, not from this PR). The Qwen deltas (max 0.004) are within same-install run to run noise (two identical main runs already differ by 0.001 at step 3). Config snapshots confirm Tests
Known trade-off The flipped flag persists after training: |
| from types import SimpleNamespace | ||
|
|
||
| import pytest | ||
| import torch |
Factor the gradient checkpointing use_cache walk into disable_use_cache and remember which configs were flipped on the model the first time, so restore_use_cache can put the original values back when switching to inference. Falsy values (None/False) are never recorded, so restore cannot invent True on configs that never had it. The record survives restore so for_training can re-disable without re-recording. Both helpers are exported for unsloth's for_inference / for_training to call; older unsloth versions simply never call them, keeping behavior unchanged.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 517249691d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if use_gradient_checkpointing in (True, "unsloth"): | ||
| disable_use_cache(model) |
There was a problem hiding this comment.
Restore cache when checkpointing is disabled
When the same model is first prepared with gradient checkpointing and later passed through prepare_model_for_training(..., use_gradient_checkpointing=False), the false path disables the module checkpointing flags above but this guard skips both disable_use_cache and restore_use_cache, leaving every previously touched config at use_cache=False. That makes subsequent non-checkpointed training or inference/generation on the same model run without KV cache even though checkpointing was explicitly turned off; call restore_use_cache(model) in the non-checkpointing path to undo the mutation recorded by the earlier prepare call.
Useful? React with 👍 / 👎.
|
Update: the inference trade-off noted above is now resolved instead of documented. 5172496 factors the walk into disable_use_cache / restore_use_cache (both exported). The first disable records the original truthy values on the model; restore_use_cache puts them back, and the record survives so a later disable flips them again without re-recording. None/False values are never recorded, so a restore cannot invent True on configs that never had it. Test coverage grew to 14 CPU cases including the full disable -> restore -> disable cycle, nested composite restore, and double-prepare semantics. unslothai/unsloth#6137 wires the unsloth side: for_inference restores, for_training re-disables when a record exists, both behind lazy imports so any version pairing of unsloth and unsloth_zoo degrades to current behavior. GPU round trip on a B200 (zoo at this PR merged with main, unsloth at #6137): Qwen2.5-0.5B kbit path and gemma-3-4b-it vision path both cycle False -> True -> False -> True across for_inference / for_training, generate works after restore, and the nested text_config follows correctly. |
| if record and originals: | ||
| try: | ||
| model._unsloth_use_cache_originals = originals | ||
| except Exception: |
…gate (#755) test_mlx_finetune_last_n_layers was born broken in #669 and stayed invisible until #739 because no CI job executed it: the version matrix only collects, the macOS MLX job runs the shim smoke test alone, and the zoo-specific CPU list does not include it. Add a small hard-gate step in repo-tests-cpu running it together with test_training_utils_use_cache (the use_cache disable/restore contract from #715). Both files are CPU-pure and run in under a second, and the job already installs the deps they need.
Logging cleanup