Skip to content

[Studio][Optimization]Add vision detection cache to is_vision_model()#15

Closed
danielhanchen wants to merge 9 commits into
mainfrom
pr-4853-head
Closed

[Studio][Optimization]Add vision detection cache to is_vision_model()#15
danielhanchen wants to merge 9 commits into
mainfrom
pr-4853-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging mirror of unslothai#4853

Original PR: unslothai#4853
Author: rolandtannous

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Summary

  • Adds a module-level _vision_detection_cache dict to is_vision_model() following the exact same pattern as the existing _audio_detection_cache used by detect_audio_type()
  • Refactors is_vision_model() into a thin cache wrapper around _is_vision_model_uncached() — detection logic is completely unchanged
  • Eliminates ~12s of wasted time per training run from redundant subprocess spawns and HuggingFace API calls (applies to all models, not just transformers 5.x)

Test plan

  • Added 11 pytest tests in test_vision_cache.py covering cache hits, misses, False caching, subprocess path, exception handling, direct detection path, audio exclusion, and token handling
  • All 11 new tests pass
  • All existing backend tests unaffected (276 passed, 4 skipped, 2 pre-existing failures unrelated to this change)
  • Run training on a transformers 5.x model (e.g. unsloth/Qwen3.5-2B) — verify "checking vision via subprocess" appears once per process instead of 2+
  • Run training on a non-transformers-5 model — verify direct path still works and caches correctly

rolandtannous and others added 6 commits April 5, 2026 02:09
…bprocess spawns

is_vision_model() is called 4-5 times per training run for the same model
with zero caching. For transformers 5.x models, each call spawns a full
subprocess (~6s each). This adds a module-level _vision_detection_cache dict
following the same pattern as the existing _audio_detection_cache used by
detect_audio_type(). The function is refactored into a thin cache wrapper
around _is_vision_model_uncached(), saving ~12s per training run.
Cache key is now (model_name, hf_token) instead of just model_name.
This prevents stale False results when an unauthenticated probe for a
gated model is followed by an authenticated call.
…ilures

- Normalize model names in cache key using resolve_cached_repo_id_case()
  to avoid duplicate entries for different casings of the same HF repo
  (aligns with case normalization from unslothai#4822)
- Return None instead of False on transient failures (network errors,
  subprocess timeouts, HF API issues) so the cache layer can distinguish
  "definitely not a vision model" from "failed to check"
- Only cache definitive True/False results; transient failures are retried
  on the next call instead of being permanently locked in as False
…ation

- Subprocess non-zero exit, JSON errors, and general exceptions return
  False (deterministic, cached) instead of None (retryable). Only
  subprocess.TimeoutExpired returns None since timeouts are transient.
- Wrap cache key normalization in try/except so resolve_cached_repo_id_case
  or normalize_path failures fall back to raw model_name instead of
  crashing callers.
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1d7fbf646

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +659 to +660
if result is not None:
_vision_detection_cache[cache_key] = result

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid caching failed subprocess detections as non-vision

This commit caches every non-None result, but _is_vision_model_subprocess() still converts many subprocess failures into False (e.g., transient HuggingFace/network errors that cause a non-zero exit). With the new cache, a temporary failure on the first lookup can now persistently mark that (model_name, hf_token) as non-vision for the entire process, so later calls never retry and cannot recover; before caching, subsequent calls could succeed.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for vision model detection to prevent redundant, expensive subprocess spawns. It refactors is_vision_model to include model name normalization and updates error handling to return None for transient failures, ensuring only definitive results are cached. Review feedback suggests implementing thread-safety for the new cache using a locking pattern to avoid race conditions in concurrent environments. There is also a recommendation to remove a redundant try-except block where internal error handling is already sufficient.

Comment on lines +653 to +662
if cache_key in _vision_detection_cache:
return _vision_detection_cache[cache_key]

result = _is_vision_model_uncached(model_name, hf_token)
# Only cache definitive results; None means a transient failure occurred
# and we should retry on the next call instead of locking in a wrong answer.
if result is not None:
_vision_detection_cache[cache_key] = result
return result
return False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current caching implementation is not thread-safe, similar to the existing _audio_detection_cache. In a concurrent environment, multiple threads could call is_vision_model for the same uncached model, leading to a race condition where the expensive _is_vision_model_uncached function is executed multiple times. This undermines the optimization.

To address this, I recommend using a lock with a double-checked locking pattern. This ensures that for a given model, the detection logic runs only once, even with concurrent requests. You'll need to define a threading.Lock at the module level (e.g., _vision_cache_lock = threading.Lock()) and import threading.

While this serializes cache misses for different models, it's preferable to running the 12-second subprocess multiple times for the same model.

    # Check cache (lock-free for performance)
    if cache_key in _vision_detection_cache:
        return _vision_detection_cache[cache_key]

    with _vision_cache_lock:
        # Double-check inside lock
        if cache_key in _vision_detection_cache:
            return _vision_detection_cache[cache_key]

        result = _is_vision_model_uncached(model_name, hf_token)
        if result is not None:
            _vision_detection_cache[cache_key] = result
            return result
        
        # For transient errors (result is None), don't cache and return False.
        return False

Comment on lines +644 to +650
try:
if is_local_path(model_name):
normalized = normalize_path(model_name)
else:
normalized = resolve_cached_repo_id_case(model_name)
except Exception:
normalized = model_name

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This try...except Exception block appears to be unnecessary. The functions it wraps, is_local_path and resolve_cached_repo_id_case, already contain their own internal error handling and provide safe fallbacks. As they are not expected to raise exceptions here, this broad exception handler is redundant and can be removed for cleaner code.

    if is_local_path(model_name):
        normalized = normalize_path(model_name)
    else:
        normalized = resolve_cached_repo_id_case(model_name)

@danielhanchen

Copy link
Copy Markdown
Member Author

Consolidated Review (12 reviewers)

Thank you for the PR! The goal of this PR is to cache is_vision_model() results to avoid redundant subprocess spawns and HuggingFace API calls during a Studio session. As a summary, this PR adds a module-level _vision_detection_cache dict, splits is_vision_model() into a cached wrapper and _is_vision_model_uncached(), normalizes cache keys, and changes subprocess timeouts to return None (uncacheable) instead of False.

Reviewers Severity Finding
9/12 High Subprocess failures (non-zero exit, error in output, generic exceptions) in _is_vision_model_subprocess are still returned as False and cached permanently. A transient HF/auth/network error on first lookup locks in a wrong False for the entire process lifetime. All failure paths should return None.
5/12 Medium Raw hf_token string stored as dict key at L651 -- secret persists in process memory for lifetime. The audio cache does not store tokens in keys at all. Use a SHA256 hash/fingerprint instead.
3/12 Medium _is_vision_model_uncached final except Exception at L732 now returns None for ALL exceptions including permanent ones (RepositoryNotFoundError, GatedRepoError). Nonexistent models retry AutoConfig.from_pretrained on every call. Should distinguish permanent vs transient failures.
3/12 Medium Thread safety -- concurrent requests for same uncached model all miss cache and each spawn a 60s subprocess. A threading.Lock with double-checked locking prevents the thundering herd.
3/12 Medium Pre-existing: callers at routes/models.py:621 and trainer.py:193,561 omit hf_token even though it is available in scope.
2/12 Low Normalization exception at L649 is silently swallowed -- should log at debug level.
1/12 Low Normalized key used for cache but original model_name passed to detection -- case-variant cache poisoning possible.

Committing fixes for findings 1-6 directly.

… safety, token security

- All subprocess failure paths now return None (transient) instead of False,
  preventing permanent misclassification of VLMs after temporary HF/auth/network errors
- Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token
- Add threading.Lock with double-checked locking to prevent thundering herd
  of concurrent subprocess spawns for the same uncached model
- Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError,
  ValueError) from transient ones in _is_vision_model_uncached
- Pass resolved/normalized model name to detection (not just cache key)
- Log normalization fallback at debug level instead of silent swallow
- Thread hf_token through callers in routes/models.py and trainer.py
  that previously omitted it
Comment thread studio/backend/utils/models/model_config.py Fixed
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for vision model detection to reduce redundant subprocess spawns and API calls, utilizing a thread-safe dictionary and token fingerprinting. The review highlights two potential issues: the current implementation treats all subprocess failures as transient (preventing caching of permanent errors like repository access issues), and the use of a global lock during potentially long-running subprocess operations may cause performance bottlenecks. I recommend addressing these by distinguishing between error types in the subprocess and adopting a more granular locking strategy.

stderr or result.stdout.strip(),
)
return False
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning None for all subprocess failures prevents the result from being cached in is_vision_model. While this is appropriate for transient errors (like timeouts), permanent errors such as "Repository Not Found" or "Gated Repo Error" (which the subprocess script catches and returns as an error field in the JSON output) should ideally be cached as False to avoid redundant subprocess spawns on subsequent calls for the same invalid or inaccessible model. This creates an inconsistency with the non-subprocess path, which correctly caches these permanent failures. Consider parsing the error message from the subprocess to distinguish between transient and permanent failures.

Comment on lines +675 to +680
with _vision_cache_lock:
# Double-check inside lock to prevent thundering herd.
if cache_key in _vision_detection_cache:
return _vision_detection_cache[cache_key]

result = _is_vision_model_uncached(resolved_name, hf_token)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Holding the global _vision_cache_lock while executing _is_vision_model_uncached is problematic because the latter can perform long-running operations, such as network requests or spawning a subprocess (which has a 60-second timeout). This serializes all vision model checks across the entire application, potentially blocking other threads for several seconds even if they are checking different models. Consider using a more granular locking strategy (e.g., a lock per model name) or releasing the lock during the computation and only re-acquiring it to update the cache, while accepting that a race condition might occasionally lead to redundant checks for the same model.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06da5aeaa1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if cache_key in _vision_detection_cache:
return _vision_detection_cache[cache_key]

result = _is_vision_model_uncached(resolved_name, hf_token)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid serializing all vision checks behind global lock

is_vision_model() acquires _vision_cache_lock and then executes _is_vision_model_uncached(), which may perform remote Hugging Face calls or a subprocess with a 60s timeout. This means concurrent cache misses for different models are forced to run one-at-a-time, so a single slow lookup can stall unrelated requests (e.g., multiple users hitting model config/training simultaneously), creating a new throughput/latency regression compared with the previous lock-free detection path.

Useful? React with 👍 / 👎.

- Move detection computation outside the lock to avoid serializing
  long-running subprocess spawns (60s timeout) and HF API calls across
  all concurrent model checks. Lock is now only held for cache writes.
- Use full SHA256 digest for token fingerprint instead of truncated
  16-char prefix to eliminate collision risk.
Comment thread studio/backend/utils/models/model_config.py Fixed
- Add fallback import path for RepositoryNotFoundError/GatedRepoError
  from huggingface_hub.utils (older hub versions) when .errors is
  not available
- Use sentinel-based dict.get() for single atomic cache read instead
  of two-step in/[] pattern (future-proof for no-GIL runtimes)
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for vision model detection to reduce overhead from subprocess spawns and HuggingFace API calls. It implements token fingerprinting for secure cache keys and refactors error handling to differentiate between transient and permanent failures. A review comment suggests simplifying the exception handling logic by moving HuggingFace error imports to a higher level to avoid nested try-except blocks.

Comment on lines +771 to +773
if RepositoryNotFoundError is not None and isinstance(
e, (RepositoryNotFoundError, GatedRepoError)
):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for checking RepositoryNotFoundError and GatedRepoError can be simplified by importing them at the module level, avoiding the nested try-except blocks inside the exception handler.

        from huggingface_hub.errors import RepositoryNotFoundError, GatedRepoError
        if isinstance(e, (RepositoryNotFoundError, GatedRepoError)):

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fcf2c4f6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

exc,
)
resolved_name = model_name
cache_key = (resolved_name, _token_fingerprint(hf_token))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include ambient auth state in vision cache key

is_vision_model() builds its cache key from hf_token only, but _is_vision_model_uncached() can still authenticate via default Hugging Face credentials when hf_token is None (load_model_config(..., use_auth=True) falls back to ambient auth). This means a result cached before login/token update can be reused after auth state changes in the same process (for example after HF_TOKEN is set later), returning a stale vision/non-vision answer for gated models until restart. Consider incorporating effective default auth state into the key, or avoid caching when hf_token is omitted.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

Fixes pushed to original PR unslothai#4853. Closing staging copy.

danielhanchen added a commit that referenced this pull request May 25, 2026
Twelve P1 findings from round 26 reviewer aggregate, plus the CI
revert of round 25 P1 #5 to a less invasive location.

1. requirements/studio.txt + requirements/single-env/constraints.txt:
   revert the round 25 huggingface-hub bump (broke Studio Update CI,
   Mac Studio Update CI, Mac Studio UI CI, Studio UI CI all with
   ResolutionImpossible against transformers==4.57.6 which requires
   hub<1.0). Standard install path stays on the well-tested 4.57.6 +
   0.36.2 + trl 0.23.1 trio.

2. requirements/no-torch-runtime.txt + pyproject.toml
   [huggingfacenotorch]: bump huggingface_hub floor from >=0.34.0 to
   >=1.3.0,<2.0 -- this is where the actual transformers 5.x +
   hub 0.36.2 broken combo can land because the file installs
   --no-deps. transformers 5.x calls hub.is_offline_mode which only
   exists in hub 1.x.

3. utils/datasets/llm_assist.py: revert round 25 P1 #4 (helper/advisor
   sharing the global llama backend) which introduced three
   regressions: a chat-evict load race after the busy precheck, a
   finally-block that could unload a user chat model, and an
   identifier mismatch the delete guard could not canonicalize. Go
   back to PRIVATE LlamaCppBackend instances and expose the active
   helper/advisor repos through a new thread-safe registry
   (helper_advisor_owns_repo / _register_helper_advisor_repo /
   _unregister_helper_advisor_repo) so DELETE /api/models/delete-cached
   can still block the rmtree.

4. routes/models.py delete_cached_model: check the new helper/advisor
   registry up front and 409 if a helper/advisor still owns the
   target repo. Closes round 26 P1 #13 and #14 (helper/advisor
   identifiers were prefixed and would never equal the raw repo id).

5. routes/models.py get_lora_base_model: validate lora_path with
   _validate_logged_identifier before it is reflected in 404 detail
   and error logs (round 26 P1 #12).

6. routes/inference.py /unload: round 21 P1 #3 added a "or not
   is_loaded" fallback that let an unload of owner/B cancel a pending
   llama load of owner/A. Replace it with a narrow
   llama_is_starting_without_identifier branch that only fires when
   llama-server is mid-startup with neither identifier set (round 26
   P1 #5).

7. routes/inference.py /unload: poll loading_model_identifier for up
   to 5 s after asyncio.to_thread(unload_model) so a legitimate
   pending-load cancel does not 503 because the load thread has not
   yet observed _cancel_event in its finally (round 26 P2 #15).

8. models/training.py TrainingStartRequest: extend identifier
   hardening to hf_dataset, subset, train_split, eval_split. Round 22
   only guarded model_name (round 26 P1 #10).

9. models/data_recipe.py SeedInspectRequest: add _no_control_chars +
   _reject_embedded_hf_token field_validators on dataset_name (round
   26 P1 #11).

Tests: 105 targeted (diffusion + cached_gguf + llama_cpp_cache +
inference_model_validation + models_get_model_config) and 1768
broader backend tests pass locally. Pre-existing
test_desktop_auth.py, test_studio_api.py, and
test_training_worker_flash_attn.py failures reproduce on HEAD
without these changes.
danielhanchen added a commit that referenced this pull request May 25, 2026
Twelve actionable P1/P2 findings from round 28 reviewer aggregate.
Skipped #3 (studio.txt huggingface-hub bump) because the empirical
CI evidence in round 26 contradicts that suggestion: bumping the
pin there breaks installs that apply constraints.txt
(transformers==4.57.6 requires hub<1.0). The actual broken combo
only happens via the --no-deps no-torch path which is already
bumped in no-torch-runtime.txt and pyproject.toml huggingfacenotorch.

1. utils/datasets/llm_assist.py: split _HELPER_ADVISOR_REFCOUNT
   into CACHE vs GPU counters. helper_advisor_owns_repo (used by
   delete-cache) reads CACHE; helper_advisor_busy (used by public
   handoffs) reads GPU. precache_helper_gguf now registers with
   gpu_owner=False so a background pre-cache download does not
   503 every chat / training / export / diffusion load.

2. utils/datasets/llm_assist.py: introduce _HELPER_ADVISOR_START_LOCK
   and wrap the busy precheck + register pair in _run_with_helper
   and _run_multi_pass_advisor. Two concurrent helper / advisor
   invocations could both pass _gpu_workload_busy_for_helper before
   either registered, then OOM each other.

3. utils/datasets/llm_assist.py: _gpu_workload_busy_for_helper now
   also returns True when another helper/advisor already holds the
   private LlamaCppBackend.

4. routes/inference.py: add _raise_if_helper_advisor_busy(workload)
   that 503s when AI Assist owns the GPU. Wire it into both chat
   load branches (GGUF + safetensors) BEFORE the existing
   _release_export_for / _release_diffusion_for calls so we do not
   first tear down an idle export / diffusion just to fail on the
   helper check.

5. routes/training.py + routes/export.py + diffusion.load_model:
   call the helper-busy check FIRST before any release helper
   fires. Mirrors the chat-load ordering.

6. routes/inference.py _release_llama_for: poll
   loading_model_identifier for up to 5 s after unload_model() so a
   cancelled pending GGUF download has time to clear its
   identifier. Mirrors the same wait round 26 added to the explicit
   /api/inference/unload route.

7. core/inference/diffusion.py _release_chat_backend_for_diffusion:
   same 5 s settling wait for cancelled pending GGUF downloads.

8. models/inference.py LoadRequest: validate every llama_extra_args
   entry through _no_control_chars + _reject_embedded_hf_token.
   The list was forwarded verbatim to a logged llama-server command
   line, so a smuggled control char or hf_... token would land in
   logs and subprocess args.

9. routes/models.py /gguf-download-progress: apply
   _validate_logged_identifier to repo_id and variant, matching the
   round 24 hardening on the adjacent generic /download-progress.

10. routes/inference.py diffusion-load RuntimeError classifier:
    treat "AI Assist ..." messages as retryable 503 instead of 400
    (round 28 P2 #15). Mirrors the round 18/19 markers for chat
    unload failures.

Tests: 105 targeted + 1768 broader backend tests pass locally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants