[Studio][Optimization]Add vision detection cache to is_vision_model() by danielhanchen · Pull Request #15 · unslothai/unsloth-staging-1

danielhanchen · 2026-04-06T11:54:58Z

Staging mirror of unslothai#4853

Original PR: unslothai#4853
Author: rolandtannous

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

Summary

Adds a module-level _vision_detection_cache dict to is_vision_model() following the exact same pattern as the existing _audio_detection_cache used by detect_audio_type()
Refactors is_vision_model() into a thin cache wrapper around _is_vision_model_uncached() — detection logic is completely unchanged
Eliminates ~12s of wasted time per training run from redundant subprocess spawns and HuggingFace API calls (applies to all models, not just transformers 5.x)

Test plan

Added 11 pytest tests in test_vision_cache.py covering cache hits, misses, False caching, subprocess path, exception handling, direct detection path, audio exclusion, and token handling
All 11 new tests pass
All existing backend tests unaffected (276 passed, 4 skipped, 2 pre-existing failures unrelated to this change)
Run training on a transformers 5.x model (e.g. unsloth/Qwen3.5-2B) — verify "checking vision via subprocess" appears once per process instead of 2+
Run training on a non-transformers-5 model — verify direct path still works and caches correctly

…bprocess spawns is_vision_model() is called 4-5 times per training run for the same model with zero caching. For transformers 5.x models, each call spawns a full subprocess (~6s each). This adds a module-level _vision_detection_cache dict following the same pattern as the existing _audio_detection_cache used by detect_audio_type(). The function is refactored into a thin cache wrapper around _is_vision_model_uncached(), saving ~12s per training run.

for more information, see https://pre-commit.ci

Cache key is now (model_name, hf_token) instead of just model_name. This prevents stale False results when an unauthenticated probe for a gated model is followed by an authenticated call.

…ilures - Normalize model names in cache key using resolve_cached_repo_id_case() to avoid duplicate entries for different casings of the same HF repo (aligns with case normalization from unslothai#4822) - Return None instead of False on transient failures (network errors, subprocess timeouts, HF API issues) so the cache layer can distinguish "definitely not a vision model" from "failed to check" - Only cache definitive True/False results; transient failures are retried on the next call instead of being permanently locked in as False

…ation - Subprocess non-zero exit, JSON errors, and general exceptions return False (deterministic, cached) instead of None (retryable). Only subprocess.TimeoutExpired returns None since timeouts are transient. - Wrap cache key normalization in try/except so resolve_cached_repo_id_case or normalize_path failures fall back to raw model_name instead of crashing callers.

danielhanchen · 2026-04-06T11:56:16Z

/gemini review

danielhanchen · 2026-04-06T11:56:17Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1d7fbf646

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T11:58:29Z

+    if result is not None:
+        _vision_detection_cache[cache_key] = result


Avoid caching failed subprocess detections as non-vision

This commit caches every non-None result, but _is_vision_model_subprocess() still converts many subprocess failures into False (e.g., transient HuggingFace/network errors that cause a non-zero exit). With the new cache, a temporary failure on the first lookup can now persistently mark that (model_name, hf_token) as non-vision for the entire process, so later calls never retry and cannot recover; before caching, subsequent calls could succeed.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces a caching mechanism for vision model detection to prevent redundant, expensive subprocess spawns. It refactors is_vision_model to include model name normalization and updates error handling to return None for transient failures, ensuring only definitive results are cached. Review feedback suggests implementing thread-safety for the new cache using a locking pattern to avoid race conditions in concurrent environments. There is also a recommendation to remove a redundant try-except block where internal error handling is already sufficient.

gemini-code-assist · 2026-04-06T11:59:52Z

+    if cache_key in _vision_detection_cache:
+        return _vision_detection_cache[cache_key]
+
+    result = _is_vision_model_uncached(model_name, hf_token)
+    # Only cache definitive results; None means a transient failure occurred
+    # and we should retry on the next call instead of locking in a wrong answer.
+    if result is not None:
+        _vision_detection_cache[cache_key] = result
+        return result
+    return False


The current caching implementation is not thread-safe, similar to the existing _audio_detection_cache. In a concurrent environment, multiple threads could call is_vision_model for the same uncached model, leading to a race condition where the expensive _is_vision_model_uncached function is executed multiple times. This undermines the optimization.

To address this, I recommend using a lock with a double-checked locking pattern. This ensures that for a given model, the detection logic runs only once, even with concurrent requests. You'll need to define a threading.Lock at the module level (e.g., _vision_cache_lock = threading.Lock()) and import threading.

While this serializes cache misses for different models, it's preferable to running the 12-second subprocess multiple times for the same model.

# Check cache (lock-free for performance) if cache_key in _vision_detection_cache: return _vision_detection_cache[cache_key] with _vision_cache_lock: # Double-check inside lock if cache_key in _vision_detection_cache: return _vision_detection_cache[cache_key] result = _is_vision_model_uncached(model_name, hf_token) if result is not None: _vision_detection_cache[cache_key] = result return result # For transient errors (result is None), don't cache and return False. return False

gemini-code-assist · 2026-04-06T11:59:52Z

+    try:
+        if is_local_path(model_name):
+            normalized = normalize_path(model_name)
+        else:
+            normalized = resolve_cached_repo_id_case(model_name)
+    except Exception:
+        normalized = model_name


This try...except Exception block appears to be unnecessary. The functions it wraps, is_local_path and resolve_cached_repo_id_case, already contain their own internal error handling and provide safe fallbacks. As they are not expected to raise exceptions here, this broad exception handler is redundant and can be removed for cleaner code.

if is_local_path(model_name): normalized = normalize_path(model_name) else: normalized = resolve_cached_repo_id_case(model_name)

danielhanchen · 2026-04-06T12:08:12Z

Consolidated Review (12 reviewers)

Thank you for the PR! The goal of this PR is to cache is_vision_model() results to avoid redundant subprocess spawns and HuggingFace API calls during a Studio session. As a summary, this PR adds a module-level _vision_detection_cache dict, splits is_vision_model() into a cached wrapper and _is_vision_model_uncached(), normalizes cache keys, and changes subprocess timeouts to return None (uncacheable) instead of False.

Reviewers	Severity	Finding
9/12	High	Subprocess failures (non-zero exit, error in output, generic exceptions) in `_is_vision_model_subprocess` are still returned as `False` and cached permanently. A transient HF/auth/network error on first lookup locks in a wrong `False` for the entire process lifetime. All failure paths should return `None`.
5/12	Medium	Raw `hf_token` string stored as dict key at L651 -- secret persists in process memory for lifetime. The audio cache does not store tokens in keys at all. Use a SHA256 hash/fingerprint instead.
3/12	Medium	`_is_vision_model_uncached` final `except Exception` at L732 now returns `None` for ALL exceptions including permanent ones (`RepositoryNotFoundError`, `GatedRepoError`). Nonexistent models retry `AutoConfig.from_pretrained` on every call. Should distinguish permanent vs transient failures.
3/12	Medium	Thread safety -- concurrent requests for same uncached model all miss cache and each spawn a 60s subprocess. A `threading.Lock` with double-checked locking prevents the thundering herd.
3/12	Medium	Pre-existing: callers at `routes/models.py:621` and `trainer.py:193,561` omit `hf_token` even though it is available in scope.
2/12	Low	Normalization exception at L649 is silently swallowed -- should log at debug level.
1/12	Low	Normalized key used for cache but original `model_name` passed to detection -- case-variant cache poisoning possible.

Committing fixes for findings 1-6 directly.

… safety, token security - All subprocess failure paths now return None (transient) instead of False, preventing permanent misclassification of VLMs after temporary HF/auth/network errors - Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token - Add threading.Lock with double-checked locking to prevent thundering herd of concurrent subprocess spawns for the same uncached model - Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError, ValueError) from transient ones in _is_vision_model_uncached - Pass resolved/normalized model name to detection (not just cache key) - Log normalization fallback at debug level instead of silent swallow - Thread hf_token through callers in routes/models.py and trainer.py that previously omitted it

danielhanchen · 2026-04-06T12:15:57Z

/gemini review

danielhanchen · 2026-04-06T12:15:57Z

@codex review

gemini-code-assist

Code Review

This pull request introduces a caching mechanism for vision model detection to reduce redundant subprocess spawns and API calls, utilizing a thread-safe dictionary and token fingerprinting. The review highlights two potential issues: the current implementation treats all subprocess failures as transient (preventing caching of permanent errors like repository access issues), and the use of a global lock during potentially long-running subprocess operations may cause performance bottlenecks. I recommend addressing these by distinguishing between error types in the subprocess and adopting a more granular locking strategy.

gemini-code-assist · 2026-04-06T12:18:17Z

                stderr or result.stdout.strip(),
            )
-            return False
+            return None


Returning None for all subprocess failures prevents the result from being cached in is_vision_model. While this is appropriate for transient errors (like timeouts), permanent errors such as "Repository Not Found" or "Gated Repo Error" (which the subprocess script catches and returns as an error field in the JSON output) should ideally be cached as False to avoid redundant subprocess spawns on subsequent calls for the same invalid or inaccessible model. This creates an inconsistency with the non-subprocess path, which correctly caches these permanent failures. Consider parsing the error message from the subprocess to distinguish between transient and permanent failures.

gemini-code-assist · 2026-04-06T12:18:17Z

+    with _vision_cache_lock:
+        # Double-check inside lock to prevent thundering herd.
+        if cache_key in _vision_detection_cache:
+            return _vision_detection_cache[cache_key]
+
+        result = _is_vision_model_uncached(resolved_name, hf_token)


Holding the global _vision_cache_lock while executing _is_vision_model_uncached is problematic because the latter can perform long-running operations, such as network requests or spawning a subprocess (which has a 60-second timeout). This serializes all vision model checks across the entire application, potentially blocking other threads for several seconds even if they are checking different models. Consider using a more granular locking strategy (e.g., a lock per model name) or releasing the lock during the computation and only re-acquiring it to update the cache, while accepting that a race condition might occasionally lead to redundant checks for the same model.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06da5aeaa1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T12:20:06Z

+        if cache_key in _vision_detection_cache:
+            return _vision_detection_cache[cache_key]
+
+        result = _is_vision_model_uncached(resolved_name, hf_token)


Avoid serializing all vision checks behind global lock

is_vision_model() acquires _vision_cache_lock and then executes _is_vision_model_uncached(), which may perform remote Hugging Face calls or a subprocess with a 60s timeout. This means concurrent cache misses for different models are forced to run one-at-a-time, so a single slow lookup can stall unrelated requests (e.g., multiple users hitting model config/training simultaneously), creating a new throughput/latency regression compared with the previous lock-free detection path.

Useful? React with 👍 / 👎.

- Move detection computation outside the lock to avoid serializing long-running subprocess spawns (60s timeout) and HF API calls across all concurrent model checks. Lock is now only held for cache writes. - Use full SHA256 digest for token fingerprint instead of truncated 16-char prefix to eliminate collision risk.

- Add fallback import path for RepositoryNotFoundError/GatedRepoError from huggingface_hub.utils (older hub versions) when .errors is not available - Use sentinel-based dict.get() for single atomic cache read instead of two-step in/[] pattern (future-proof for no-GIL runtimes)

danielhanchen · 2026-04-06T12:30:50Z

/gemini review

danielhanchen · 2026-04-06T12:30:51Z

@codex review

gemini-code-assist

Code Review

This pull request introduces a caching mechanism for vision model detection to reduce overhead from subprocess spawns and HuggingFace API calls. It implements token fingerprinting for secure cache keys and refactors error handling to differentiate between transient and permanent failures. A review comment suggests simplifying the exception handling logic by moving HuggingFace error imports to a higher level to avoid nested try-except blocks.

gemini-code-assist · 2026-04-06T12:33:32Z

+        if RepositoryNotFoundError is not None and isinstance(
+            e, (RepositoryNotFoundError, GatedRepoError)
+        ):


The logic for checking RepositoryNotFoundError and GatedRepoError can be simplified by importing them at the module level, avoiding the nested try-except blocks inside the exception handler.

from huggingface_hub.errors import RepositoryNotFoundError, GatedRepoError if isinstance(e, (RepositoryNotFoundError, GatedRepoError)):

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fcf2c4f6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T12:36:23Z

+            exc,
+        )
+        resolved_name = model_name
+    cache_key = (resolved_name, _token_fingerprint(hf_token))


Include ambient auth state in vision cache key

is_vision_model() builds its cache key from hf_token only, but _is_vision_model_uncached() can still authenticate via default Hugging Face credentials when hf_token is None (load_model_config(..., use_auth=True) falls back to ambient auth). This means a result cached before login/token update can be reused after auth state changes in the same process (for example after HF_TOKEN is set later), returning a stale vision/non-vision answer for gated models until restart. Consider incorporating effective default auth state into the key, or avoid caching when hf_token is omitted.

Useful? React with 👍 / 👎.

danielhanchen · 2026-04-06T12:57:42Z

Fixes pushed to original PR unslothai#4853. Closing staging copy.

Twelve P1 findings from round 26 reviewer aggregate, plus the CI revert of round 25 P1 #5 to a less invasive location. 1. requirements/studio.txt + requirements/single-env/constraints.txt: revert the round 25 huggingface-hub bump (broke Studio Update CI, Mac Studio Update CI, Mac Studio UI CI, Studio UI CI all with ResolutionImpossible against transformers==4.57.6 which requires hub<1.0). Standard install path stays on the well-tested 4.57.6 + 0.36.2 + trl 0.23.1 trio. 2. requirements/no-torch-runtime.txt + pyproject.toml [huggingfacenotorch]: bump huggingface_hub floor from >=0.34.0 to >=1.3.0,<2.0 -- this is where the actual transformers 5.x + hub 0.36.2 broken combo can land because the file installs --no-deps. transformers 5.x calls hub.is_offline_mode which only exists in hub 1.x. 3. utils/datasets/llm_assist.py: revert round 25 P1 #4 (helper/advisor sharing the global llama backend) which introduced three regressions: a chat-evict load race after the busy precheck, a finally-block that could unload a user chat model, and an identifier mismatch the delete guard could not canonicalize. Go back to PRIVATE LlamaCppBackend instances and expose the active helper/advisor repos through a new thread-safe registry (helper_advisor_owns_repo / _register_helper_advisor_repo / _unregister_helper_advisor_repo) so DELETE /api/models/delete-cached can still block the rmtree. 4. routes/models.py delete_cached_model: check the new helper/advisor registry up front and 409 if a helper/advisor still owns the target repo. Closes round 26 P1 #13 and #14 (helper/advisor identifiers were prefixed and would never equal the raw repo id). 5. routes/models.py get_lora_base_model: validate lora_path with _validate_logged_identifier before it is reflected in 404 detail and error logs (round 26 P1 #12). 6. routes/inference.py /unload: round 21 P1 #3 added a "or not is_loaded" fallback that let an unload of owner/B cancel a pending llama load of owner/A. Replace it with a narrow llama_is_starting_without_identifier branch that only fires when llama-server is mid-startup with neither identifier set (round 26 P1 #5). 7. routes/inference.py /unload: poll loading_model_identifier for up to 5 s after asyncio.to_thread(unload_model) so a legitimate pending-load cancel does not 503 because the load thread has not yet observed _cancel_event in its finally (round 26 P2 #15). 8. models/training.py TrainingStartRequest: extend identifier hardening to hf_dataset, subset, train_split, eval_split. Round 22 only guarded model_name (round 26 P1 #10). 9. models/data_recipe.py SeedInspectRequest: add _no_control_chars + _reject_embedded_hf_token field_validators on dataset_name (round 26 P1 #11). Tests: 105 targeted (diffusion + cached_gguf + llama_cpp_cache + inference_model_validation + models_get_model_config) and 1768 broader backend tests pass locally. Pre-existing test_desktop_auth.py, test_studio_api.py, and test_training_worker_flash_attn.py failures reproduce on HEAD without these changes.

Twelve actionable P1/P2 findings from round 28 reviewer aggregate. Skipped #3 (studio.txt huggingface-hub bump) because the empirical CI evidence in round 26 contradicts that suggestion: bumping the pin there breaks installs that apply constraints.txt (transformers==4.57.6 requires hub<1.0). The actual broken combo only happens via the --no-deps no-torch path which is already bumped in no-torch-runtime.txt and pyproject.toml huggingfacenotorch. 1. utils/datasets/llm_assist.py: split _HELPER_ADVISOR_REFCOUNT into CACHE vs GPU counters. helper_advisor_owns_repo (used by delete-cache) reads CACHE; helper_advisor_busy (used by public handoffs) reads GPU. precache_helper_gguf now registers with gpu_owner=False so a background pre-cache download does not 503 every chat / training / export / diffusion load. 2. utils/datasets/llm_assist.py: introduce _HELPER_ADVISOR_START_LOCK and wrap the busy precheck + register pair in _run_with_helper and _run_multi_pass_advisor. Two concurrent helper / advisor invocations could both pass _gpu_workload_busy_for_helper before either registered, then OOM each other. 3. utils/datasets/llm_assist.py: _gpu_workload_busy_for_helper now also returns True when another helper/advisor already holds the private LlamaCppBackend. 4. routes/inference.py: add _raise_if_helper_advisor_busy(workload) that 503s when AI Assist owns the GPU. Wire it into both chat load branches (GGUF + safetensors) BEFORE the existing _release_export_for / _release_diffusion_for calls so we do not first tear down an idle export / diffusion just to fail on the helper check. 5. routes/training.py + routes/export.py + diffusion.load_model: call the helper-busy check FIRST before any release helper fires. Mirrors the chat-load ordering. 6. routes/inference.py _release_llama_for: poll loading_model_identifier for up to 5 s after unload_model() so a cancelled pending GGUF download has time to clear its identifier. Mirrors the same wait round 26 added to the explicit /api/inference/unload route. 7. core/inference/diffusion.py _release_chat_backend_for_diffusion: same 5 s settling wait for cancelled pending GGUF downloads. 8. models/inference.py LoadRequest: validate every llama_extra_args entry through _no_control_chars + _reject_embedded_hf_token. The list was forwarded verbatim to a logged llama-server command line, so a smuggled control char or hf_... token would land in logs and subprocess args. 9. routes/models.py /gguf-download-progress: apply _validate_logged_identifier to repo_id and variant, matching the round 24 hardening on the adjacent generic /download-progress. 10. routes/inference.py diffusion-load RuntimeError classifier: treat "AI Assist ..." messages as retryable 503 instead of 400 (round 28 P2 #15). Mirrors the round 18/19 markers for chat unload failures. Tests: 105 targeted + 1768 broader backend tests pass locally.

rolandtannous and others added 6 commits April 5, 2026 02:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

a00536c

for more information, see https://pre-commit.ci

Include hf_token in vision cache key for gated model correctness

0478d78

Cache key is now (model_name, hf_token) instead of just model_name. This prevents stale False results when an unauthenticated probe for a gated model is followed by an authenticated call.

Remove test file from main PR - will be submitted separately

efacb85

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

github-code-quality Bot found potential problems Apr 6, 2026

View reviewed changes

Comment thread studio/backend/utils/models/model_config.py Fixed

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

github-code-quality Bot found potential problems Apr 6, 2026

View reviewed changes

Comment thread studio/backend/utils/models/model_config.py Fixed

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

danielhanchen closed this Apr 6, 2026

		if result is not None:
		_vision_detection_cache[cache_key] = result

Conversation

danielhanchen commented Apr 6, 2026

Staging mirror of unslothai#4853

Original description

Summary

Test plan

Uh oh!

danielhanchen commented Apr 6, 2026

Uh oh!

danielhanchen commented Apr 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 6, 2026

Consolidated Review (12 reviewers)

Uh oh!

Uh oh!

danielhanchen commented Apr 6, 2026

Uh oh!

danielhanchen commented Apr 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielhanchen commented Apr 6, 2026

Uh oh!

danielhanchen commented Apr 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development