[Studio][Optimization]Add vision detection cache to is_vision_model()#15
[Studio][Optimization]Add vision detection cache to is_vision_model()#15danielhanchen wants to merge 9 commits into
Conversation
…bprocess spawns is_vision_model() is called 4-5 times per training run for the same model with zero caching. For transformers 5.x models, each call spawns a full subprocess (~6s each). This adds a module-level _vision_detection_cache dict following the same pattern as the existing _audio_detection_cache used by detect_audio_type(). The function is refactored into a thin cache wrapper around _is_vision_model_uncached(), saving ~12s per training run.
for more information, see https://pre-commit.ci
Cache key is now (model_name, hf_token) instead of just model_name. This prevents stale False results when an unauthenticated probe for a gated model is followed by an authenticated call.
…ilures - Normalize model names in cache key using resolve_cached_repo_id_case() to avoid duplicate entries for different casings of the same HF repo (aligns with case normalization from unslothai#4822) - Return None instead of False on transient failures (network errors, subprocess timeouts, HF API issues) so the cache layer can distinguish "definitely not a vision model" from "failed to check" - Only cache definitive True/False results; transient failures are retried on the next call instead of being permanently locked in as False
…ation - Subprocess non-zero exit, JSON errors, and general exceptions return False (deterministic, cached) instead of None (retryable). Only subprocess.TimeoutExpired returns None since timeouts are transient. - Wrap cache key normalization in try/except so resolve_cached_repo_id_case or normalize_path failures fall back to raw model_name instead of crashing callers.
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a1d7fbf646
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if result is not None: | ||
| _vision_detection_cache[cache_key] = result |
There was a problem hiding this comment.
Avoid caching failed subprocess detections as non-vision
This commit caches every non-None result, but _is_vision_model_subprocess() still converts many subprocess failures into False (e.g., transient HuggingFace/network errors that cause a non-zero exit). With the new cache, a temporary failure on the first lookup can now persistently mark that (model_name, hf_token) as non-vision for the entire process, so later calls never retry and cannot recover; before caching, subsequent calls could succeed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a caching mechanism for vision model detection to prevent redundant, expensive subprocess spawns. It refactors is_vision_model to include model name normalization and updates error handling to return None for transient failures, ensuring only definitive results are cached. Review feedback suggests implementing thread-safety for the new cache using a locking pattern to avoid race conditions in concurrent environments. There is also a recommendation to remove a redundant try-except block where internal error handling is already sufficient.
| if cache_key in _vision_detection_cache: | ||
| return _vision_detection_cache[cache_key] | ||
|
|
||
| result = _is_vision_model_uncached(model_name, hf_token) | ||
| # Only cache definitive results; None means a transient failure occurred | ||
| # and we should retry on the next call instead of locking in a wrong answer. | ||
| if result is not None: | ||
| _vision_detection_cache[cache_key] = result | ||
| return result | ||
| return False |
There was a problem hiding this comment.
The current caching implementation is not thread-safe, similar to the existing _audio_detection_cache. In a concurrent environment, multiple threads could call is_vision_model for the same uncached model, leading to a race condition where the expensive _is_vision_model_uncached function is executed multiple times. This undermines the optimization.
To address this, I recommend using a lock with a double-checked locking pattern. This ensures that for a given model, the detection logic runs only once, even with concurrent requests. You'll need to define a threading.Lock at the module level (e.g., _vision_cache_lock = threading.Lock()) and import threading.
While this serializes cache misses for different models, it's preferable to running the 12-second subprocess multiple times for the same model.
# Check cache (lock-free for performance)
if cache_key in _vision_detection_cache:
return _vision_detection_cache[cache_key]
with _vision_cache_lock:
# Double-check inside lock
if cache_key in _vision_detection_cache:
return _vision_detection_cache[cache_key]
result = _is_vision_model_uncached(model_name, hf_token)
if result is not None:
_vision_detection_cache[cache_key] = result
return result
# For transient errors (result is None), don't cache and return False.
return False| try: | ||
| if is_local_path(model_name): | ||
| normalized = normalize_path(model_name) | ||
| else: | ||
| normalized = resolve_cached_repo_id_case(model_name) | ||
| except Exception: | ||
| normalized = model_name |
There was a problem hiding this comment.
This try...except Exception block appears to be unnecessary. The functions it wraps, is_local_path and resolve_cached_repo_id_case, already contain their own internal error handling and provide safe fallbacks. As they are not expected to raise exceptions here, this broad exception handler is redundant and can be removed for cleaner code.
if is_local_path(model_name):
normalized = normalize_path(model_name)
else:
normalized = resolve_cached_repo_id_case(model_name)
Consolidated Review (12 reviewers)Thank you for the PR! The goal of this PR is to cache
Committing fixes for findings 1-6 directly. |
… safety, token security - All subprocess failure paths now return None (transient) instead of False, preventing permanent misclassification of VLMs after temporary HF/auth/network errors - Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token - Add threading.Lock with double-checked locking to prevent thundering herd of concurrent subprocess spawns for the same uncached model - Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError, ValueError) from transient ones in _is_vision_model_uncached - Pass resolved/normalized model name to detection (not just cache key) - Log normalization fallback at debug level instead of silent swallow - Thread hf_token through callers in routes/models.py and trainer.py that previously omitted it
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces a caching mechanism for vision model detection to reduce redundant subprocess spawns and API calls, utilizing a thread-safe dictionary and token fingerprinting. The review highlights two potential issues: the current implementation treats all subprocess failures as transient (preventing caching of permanent errors like repository access issues), and the use of a global lock during potentially long-running subprocess operations may cause performance bottlenecks. I recommend addressing these by distinguishing between error types in the subprocess and adopting a more granular locking strategy.
| stderr or result.stdout.strip(), | ||
| ) | ||
| return False | ||
| return None |
There was a problem hiding this comment.
Returning None for all subprocess failures prevents the result from being cached in is_vision_model. While this is appropriate for transient errors (like timeouts), permanent errors such as "Repository Not Found" or "Gated Repo Error" (which the subprocess script catches and returns as an error field in the JSON output) should ideally be cached as False to avoid redundant subprocess spawns on subsequent calls for the same invalid or inaccessible model. This creates an inconsistency with the non-subprocess path, which correctly caches these permanent failures. Consider parsing the error message from the subprocess to distinguish between transient and permanent failures.
| with _vision_cache_lock: | ||
| # Double-check inside lock to prevent thundering herd. | ||
| if cache_key in _vision_detection_cache: | ||
| return _vision_detection_cache[cache_key] | ||
|
|
||
| result = _is_vision_model_uncached(resolved_name, hf_token) |
There was a problem hiding this comment.
Holding the global _vision_cache_lock while executing _is_vision_model_uncached is problematic because the latter can perform long-running operations, such as network requests or spawning a subprocess (which has a 60-second timeout). This serializes all vision model checks across the entire application, potentially blocking other threads for several seconds even if they are checking different models. Consider using a more granular locking strategy (e.g., a lock per model name) or releasing the lock during the computation and only re-acquiring it to update the cache, while accepting that a race condition might occasionally lead to redundant checks for the same model.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 06da5aeaa1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if cache_key in _vision_detection_cache: | ||
| return _vision_detection_cache[cache_key] | ||
|
|
||
| result = _is_vision_model_uncached(resolved_name, hf_token) |
There was a problem hiding this comment.
Avoid serializing all vision checks behind global lock
is_vision_model() acquires _vision_cache_lock and then executes _is_vision_model_uncached(), which may perform remote Hugging Face calls or a subprocess with a 60s timeout. This means concurrent cache misses for different models are forced to run one-at-a-time, so a single slow lookup can stall unrelated requests (e.g., multiple users hitting model config/training simultaneously), creating a new throughput/latency regression compared with the previous lock-free detection path.
Useful? React with 👍 / 👎.
- Move detection computation outside the lock to avoid serializing long-running subprocess spawns (60s timeout) and HF API calls across all concurrent model checks. Lock is now only held for cache writes. - Use full SHA256 digest for token fingerprint instead of truncated 16-char prefix to eliminate collision risk.
- Add fallback import path for RepositoryNotFoundError/GatedRepoError from huggingface_hub.utils (older hub versions) when .errors is not available - Use sentinel-based dict.get() for single atomic cache read instead of two-step in/[] pattern (future-proof for no-GIL runtimes)
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces a caching mechanism for vision model detection to reduce overhead from subprocess spawns and HuggingFace API calls. It implements token fingerprinting for secure cache keys and refactors error handling to differentiate between transient and permanent failures. A review comment suggests simplifying the exception handling logic by moving HuggingFace error imports to a higher level to avoid nested try-except blocks.
| if RepositoryNotFoundError is not None and isinstance( | ||
| e, (RepositoryNotFoundError, GatedRepoError) | ||
| ): |
There was a problem hiding this comment.
The logic for checking RepositoryNotFoundError and GatedRepoError can be simplified by importing them at the module level, avoiding the nested try-except blocks inside the exception handler.
from huggingface_hub.errors import RepositoryNotFoundError, GatedRepoError
if isinstance(e, (RepositoryNotFoundError, GatedRepoError)):There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fcf2c4f6d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| exc, | ||
| ) | ||
| resolved_name = model_name | ||
| cache_key = (resolved_name, _token_fingerprint(hf_token)) |
There was a problem hiding this comment.
Include ambient auth state in vision cache key
is_vision_model() builds its cache key from hf_token only, but _is_vision_model_uncached() can still authenticate via default Hugging Face credentials when hf_token is None (load_model_config(..., use_auth=True) falls back to ambient auth). This means a result cached before login/token update can be reused after auth state changes in the same process (for example after HF_TOKEN is set later), returning a stale vision/non-vision answer for gated models until restart. Consider incorporating effective default auth state into the key, or avoid caching when hf_token is omitted.
Useful? React with 👍 / 👎.
|
Fixes pushed to original PR unslothai#4853. Closing staging copy. |
Twelve P1 findings from round 26 reviewer aggregate, plus the CI revert of round 25 P1 #5 to a less invasive location. 1. requirements/studio.txt + requirements/single-env/constraints.txt: revert the round 25 huggingface-hub bump (broke Studio Update CI, Mac Studio Update CI, Mac Studio UI CI, Studio UI CI all with ResolutionImpossible against transformers==4.57.6 which requires hub<1.0). Standard install path stays on the well-tested 4.57.6 + 0.36.2 + trl 0.23.1 trio. 2. requirements/no-torch-runtime.txt + pyproject.toml [huggingfacenotorch]: bump huggingface_hub floor from >=0.34.0 to >=1.3.0,<2.0 -- this is where the actual transformers 5.x + hub 0.36.2 broken combo can land because the file installs --no-deps. transformers 5.x calls hub.is_offline_mode which only exists in hub 1.x. 3. utils/datasets/llm_assist.py: revert round 25 P1 #4 (helper/advisor sharing the global llama backend) which introduced three regressions: a chat-evict load race after the busy precheck, a finally-block that could unload a user chat model, and an identifier mismatch the delete guard could not canonicalize. Go back to PRIVATE LlamaCppBackend instances and expose the active helper/advisor repos through a new thread-safe registry (helper_advisor_owns_repo / _register_helper_advisor_repo / _unregister_helper_advisor_repo) so DELETE /api/models/delete-cached can still block the rmtree. 4. routes/models.py delete_cached_model: check the new helper/advisor registry up front and 409 if a helper/advisor still owns the target repo. Closes round 26 P1 #13 and #14 (helper/advisor identifiers were prefixed and would never equal the raw repo id). 5. routes/models.py get_lora_base_model: validate lora_path with _validate_logged_identifier before it is reflected in 404 detail and error logs (round 26 P1 #12). 6. routes/inference.py /unload: round 21 P1 #3 added a "or not is_loaded" fallback that let an unload of owner/B cancel a pending llama load of owner/A. Replace it with a narrow llama_is_starting_without_identifier branch that only fires when llama-server is mid-startup with neither identifier set (round 26 P1 #5). 7. routes/inference.py /unload: poll loading_model_identifier for up to 5 s after asyncio.to_thread(unload_model) so a legitimate pending-load cancel does not 503 because the load thread has not yet observed _cancel_event in its finally (round 26 P2 #15). 8. models/training.py TrainingStartRequest: extend identifier hardening to hf_dataset, subset, train_split, eval_split. Round 22 only guarded model_name (round 26 P1 #10). 9. models/data_recipe.py SeedInspectRequest: add _no_control_chars + _reject_embedded_hf_token field_validators on dataset_name (round 26 P1 #11). Tests: 105 targeted (diffusion + cached_gguf + llama_cpp_cache + inference_model_validation + models_get_model_config) and 1768 broader backend tests pass locally. Pre-existing test_desktop_auth.py, test_studio_api.py, and test_training_worker_flash_attn.py failures reproduce on HEAD without these changes.
Twelve actionable P1/P2 findings from round 28 reviewer aggregate. Skipped #3 (studio.txt huggingface-hub bump) because the empirical CI evidence in round 26 contradicts that suggestion: bumping the pin there breaks installs that apply constraints.txt (transformers==4.57.6 requires hub<1.0). The actual broken combo only happens via the --no-deps no-torch path which is already bumped in no-torch-runtime.txt and pyproject.toml huggingfacenotorch. 1. utils/datasets/llm_assist.py: split _HELPER_ADVISOR_REFCOUNT into CACHE vs GPU counters. helper_advisor_owns_repo (used by delete-cache) reads CACHE; helper_advisor_busy (used by public handoffs) reads GPU. precache_helper_gguf now registers with gpu_owner=False so a background pre-cache download does not 503 every chat / training / export / diffusion load. 2. utils/datasets/llm_assist.py: introduce _HELPER_ADVISOR_START_LOCK and wrap the busy precheck + register pair in _run_with_helper and _run_multi_pass_advisor. Two concurrent helper / advisor invocations could both pass _gpu_workload_busy_for_helper before either registered, then OOM each other. 3. utils/datasets/llm_assist.py: _gpu_workload_busy_for_helper now also returns True when another helper/advisor already holds the private LlamaCppBackend. 4. routes/inference.py: add _raise_if_helper_advisor_busy(workload) that 503s when AI Assist owns the GPU. Wire it into both chat load branches (GGUF + safetensors) BEFORE the existing _release_export_for / _release_diffusion_for calls so we do not first tear down an idle export / diffusion just to fail on the helper check. 5. routes/training.py + routes/export.py + diffusion.load_model: call the helper-busy check FIRST before any release helper fires. Mirrors the chat-load ordering. 6. routes/inference.py _release_llama_for: poll loading_model_identifier for up to 5 s after unload_model() so a cancelled pending GGUF download has time to clear its identifier. Mirrors the same wait round 26 added to the explicit /api/inference/unload route. 7. core/inference/diffusion.py _release_chat_backend_for_diffusion: same 5 s settling wait for cancelled pending GGUF downloads. 8. models/inference.py LoadRequest: validate every llama_extra_args entry through _no_control_chars + _reject_embedded_hf_token. The list was forwarded verbatim to a logged llama-server command line, so a smuggled control char or hf_... token would land in logs and subprocess args. 9. routes/models.py /gguf-download-progress: apply _validate_logged_identifier to repo_id and variant, matching the round 24 hardening on the adjacent generic /download-progress. 10. routes/inference.py diffusion-load RuntimeError classifier: treat "AI Assist ..." messages as retryable 503 instead of 400 (round 28 P2 #15). Mirrors the round 18/19 markers for chat unload failures. Tests: 105 targeted + 1768 broader backend tests pass locally.
Staging mirror of unslothai#4853
Original PR: unslothai#4853
Author: rolandtannous
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description
Summary
_vision_detection_cachedict tois_vision_model()following the exact same pattern as the existing_audio_detection_cacheused bydetect_audio_type()is_vision_model()into a thin cache wrapper around_is_vision_model_uncached()— detection logic is completely unchangedTest plan
test_vision_cache.pycovering cache hits, misses, False caching, subprocess path, exception handling, direct detection path, audio exclusion, and token handlingunsloth/Qwen3.5-2B) — verify"checking vision via subprocess"appears once per process instead of 2+