Skip to content

studio: reuse HF cached repo casing to prevent duplicate downloads#4822

Merged
danielhanchen merged 6 commits into
unslothai:mainfrom
Imagineer99:main
Apr 3, 2026
Merged

studio: reuse HF cached repo casing to prevent duplicate downloads#4822
danielhanchen merged 6 commits into
unslothai:mainfrom
Imagineer99:main

Conversation

@Imagineer99

@Imagineer99 Imagineer99 commented Apr 3, 2026

Copy link
Copy Markdown
Collaborator

Related to #2592

Some Studio and loader paths lowercased remote model repo IDs.
HuggingFace accepts that for API calls, but the on-disk cache folder name follows the exact string you used when downloading.

Mixed casing from Unsloth vs Transformers/other tools then showed up as two models--… dirs for the same repo and triggered duplicate downloads.

Studio backend:
Stop forcing lowercase in ModelConfig / UI selection; resolve to an existing cached repo-id casing when a case-only variant is already on disk; normalize early in get_model_config so vision/embedding/audio checks use the resolved id.

Paths:
Add resolve_cached_repo_id_case + use it from get_cache_path; optional memoization for hits only (no memo on cache miss so a later download is still picked up).

Inference worker:
Use the same resolver when watching models--…/blobs for download progress.

Core loader:
Preserve casing when stripping -bnb-4bit suffixes on the AMD no-prequant path; return the user’s explicit -bnb-4bit id casing from get_model_name mapping.

Tests moved to #4823*

Note: Dataset repos can have the same class of issue; this PR only handles model repo IDs. Follow-up can mirror the same pattern for repo_type=dataset if we want parity.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a case-resolution utility for Hugging Face repository IDs to avoid duplicate model downloads on case-sensitive systems. By checking the local cache for existing variants before defaulting to lowercasing, the system preserves user intent and optimizes storage. Changes span the inference worker, model routes, and loading utilities, supported by new unit tests. Feedback was provided regarding a broad exception handler in the path utility that lacks logging, which could hinder troubleshooting.

Comment thread studio/backend/utils/paths/path_utils.py
Tests for resolve_cached_repo_id_case and get_model_config case resolution
belong in their own PR to keep this change focused on the runtime fix.
Imagineer99 and others added 2 commits April 3, 2026 12:47
- Check exact-case path before memo to ensure a newly-appeared exact
  match always wins over a previously memoized variant
- Validate memoized entries still exist on disk before returning them
  to prevent stale results when cache dirs are deleted/recreated

@danielhanchen danielhanchen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! The goal of this PR is to prevent duplicate HF model downloads caused by mixed-casing repo IDs between Unsloth Studio and other HuggingFace tools. As a summary, this PR replaces the previous unconditional lowercasing of remote model identifiers with a cache-aware resolver that reuses the exact casing already present in the local HF cache, and also preserves the original casing when stripping -bnb-4bit suffixes in the core loader.

I ran 7 parallel independent reviews + 3 additional independent reviewers (10 total), covering the runtime logic, memoization correctness, cross-platform behavior (macOS/Windows case-insensitive FS, WSL), suffix stripping edge cases, and backwards compatibility with transformers 4.57.6 / 5.4.0 and TRL 0.22.2 / 0.27.1.

Reviewers Severity Finding
7/10 Fixed Stale memoization: variant hits were memoized without revalidation, so deleted/recreated cache dirs returned stale results in the same process
2/10 Not a bug Path.exists() exact-hit check on case-insensitive FS (macOS/Windows) -- this actually produces correct behavior on those platforms
2/10 P2/wontfix Extra cache-dir scan in get_cache_path -- acceptable overhead for infrequent calls
1/10 False positive -- to / replacement corrupts repo IDs with -- -- HF repo IDs never contain --
1/10 False positive (0.01% edge) Lexicographic tie-break could pick empty cache variant -- requires two case-variant dirs, one empty, which is extremely unlikely in practice

The stale memoization issue (7/10 consensus) was the only real bug. I have already pushed a fix that:

  1. Checks the exact-case path before the memo so newly-appeared exact matches always win
  2. Validates memoized entries still exist on disk before returning them
  3. Drops stale memo entries and re-scans on the next call

I also moved the test files to a separate PR (#4823) to keep this one focused on the runtime changes.

Concrete suggestions for each finding below.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0073692dee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +112 to +113
resolved_name = resolve_cached_repo_id_case(model_name)
model_cache_name = resolved_name.replace("/", "--")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip case resolution for local filesystem paths

get_cache_path now unconditionally calls resolve_cached_repo_id_case, but that resolver treats any string containing / as a Hub repo ID; local paths like foo/bar (which are valid local selections when the directory exists) now go through case-insensitive HF cache matching. In the detect_audio_type -> _detect_audio_from_tokenizer -> get_cache_path path, this can resolve a local model name to a models--... cache directory and read tokenizer metadata from the wrong source, producing incorrect capability detection for local models while also adding an unnecessary full cache scan.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a case-resolution mechanism for Hugging Face repository IDs to prevent redundant model downloads caused by local file system case-sensitivity. It introduces the resolve_cached_repo_id_case utility with memoization and updates various model loading paths to preserve original casing while interacting with the local cache. Review feedback suggests refining the repository ID reconstruction logic to handle names containing double dashes, extending resolution to single-word identifiers, using more specific directory checks, and removing redundant fallback logic in path construction.

repo_part = entry.name[len("models--") :]
if not repo_part:
continue
candidates.append(repo_part.replace("--", "/"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using .replace("--", "/") to reconstruct the repository ID is incorrect because it replaces all occurrences of --, including those that were part of the original namespace or repository name (e.g., my--org/model would incorrectly become my/org/model). Instead, the repository ID should be reconstructed by splitting the folder name based on the structure of the requested model_name.

Suggested change
candidates.append(repo_part.replace("--", "/"))
parts = model_name.split("/")
resolved_parts = []
current_pos = 0
for p in parts:
resolved_parts.append(repo_part[current_pos : current_pos + len(p)])
current_pos += len(p) + 2
candidates.append("/".join(resolved_parts))

"""
_CACHE_CASE_RESOLUTION_STATS["calls"] += 1

if not model_name or "/" not in model_name:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check for a slash in the model_name is unnecessarily restrictive. Hugging Face repository IDs can be single-word identifiers (e.g., gpt2, bert-base-uncased) which are stored as models--gpt2 in the cache. Removing this restriction allows the resolver to handle casing issues for these models as well.

Suggested change
if not model_name or "/" not in model_name:
if not model_name:

# Always check the exact-case path first so a newly-appeared exact match
# wins over any previously memoized variant.
exact_path = cache_dir / expected_dir
if exact_path.exists():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is safer to use .is_dir() instead of .exists() when checking for Hugging Face cache repository entries, as these are expected to be directories. This also maintains consistency with the check performed on line 182.

Suggested change
if exact_path.exists():
if exact_path.is_dir():

Comment on lines +719 to +722
repo_dir = get_cache_path(model_name)
if repo_dir is None:
repo_dir_name = f"models--{model_name.replace('/', '--')}"
repo_dir = cache_dir / repo_dir_name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This fallback logic is redundant. get_cache_path(model_name) already performs case resolution and checks for the existence of the repository directory (including the exact-case fallback). If it returns None, the directory does not exist in the cache, and manually constructing the path here will not change that. You can simplify this by using a single assignment with a fallback to ensure repo_dir is a valid Path object for the subsequent .exists() check.

Suggested change
repo_dir = get_cache_path(model_name)
if repo_dir is None:
repo_dir_name = f"models--{model_name.replace('/', '--')}"
repo_dir = cache_dir / repo_dir_name
repo_dir = get_cache_path(model_name) or (cache_dir / f"models--{model_name.replace('/', '--')}")

@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

- Use .is_dir() instead of .exists() for exact-case cache check
  (cache entries are always directories)
- Remove redundant fallback in _detect_audio_from_tokenizer since
  get_cache_path already handles case resolution and returns None
  when the model is not cached

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a case-resolution mechanism for Hugging Face repository IDs to prevent duplicate downloads by matching requested names against existing local cache entries. It replaces forced lowercasing with a memoized resolution utility and updates model loading paths across the backend. Feedback highlights an opportunity to simplify redundant path checks in model_config.py and suggests extending case resolution to support top-level model identifiers by removing the mandatory slash check.

Comment on lines 720 to 723
if repo_dir is None:
repo_dir_name = f"models--{model_name.replace('/', '--')}"
repo_dir = cache_dir / repo_dir_name
if repo_dir.exists():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This manual path construction and existence check is redundant. get_cache_path(model_name) already performs these checks, including the case-resolution logic and the exact-match fallback. If get_cache_path returns None, it means the model directory (including the original casing) does not exist in the cache.

Suggested change
if repo_dir is None:
repo_dir_name = f"models--{model_name.replace('/', '--')}"
repo_dir = cache_dir / repo_dir_name
if repo_dir.exists():
if repo_dir:

Comment on lines +156 to +158
if not model_name or "/" not in model_name:
_CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
return model_name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The requirement for a slash (/) in the model name prevents case resolution for top-level Hugging Face models (e.g., gpt2, bert-base-uncased). While most models are namespaced, supporting top-level models would ensure consistent behavior across all Hugging Face identifiers. Removing this check is safe as is_local_path is typically called before this resolver.

Suggested change
if not model_name or "/" not in model_name:
_CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
return model_name
if not model_name:
_CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
return model_name

@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a case-resolution mechanism for Hugging Face repository IDs to avoid redundant downloads caused by case-sensitivity differences between local storage and the HF Hub. It introduces the resolve_cached_repo_id_case utility, which scans the local cache for existing variants, and updates model loading paths in the backend and Unsloth modules to utilize this resolution. Review feedback suggests optimizing the resolution function by using os.scandir for performance, extending support to official models that lack slashes in their names, and improving memoization efficiency by using normalized keys.

Comment on lines +147 to +217
def resolve_cached_repo_id_case(model_name: str, use_memo: bool = True) -> str:
"""Resolve repo_id to the exact casing already present in local HF cache.

Policy: prefer the requested/canonical repo_id, but if a case-variant already
exists in local HF cache, reuse that exact cached spelling. This avoids
duplicate downloads while preserving user intent whenever possible.
"""
_CACHE_CASE_RESOLUTION_STATS["calls"] += 1

if not model_name or "/" not in model_name:
_CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
return model_name

cache_dir = _hf_hub_cache_dir()
if not cache_dir.exists():
_CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
return model_name

expected_dir = f"models--{model_name.replace('/', '--')}"

# Always check the exact-case path first so a newly-appeared exact match
# wins over any previously memoized variant.
exact_path = cache_dir / expected_dir
if exact_path.is_dir():
if use_memo:
_CACHE_CASE_RESOLUTION_MEMO[model_name] = model_name
_CACHE_CASE_RESOLUTION_STATS["exact_hits"] += 1
return model_name

# Validate memoized entries still exist on disk before returning them.
# This prevents stale results when cache dirs are deleted/recreated.
if use_memo:
cached = _CACHE_CASE_RESOLUTION_MEMO.get(model_name)
if cached is not None:
cached_path = cache_dir / f"models--{cached.replace('/', '--')}"
if cached_path.is_dir():
_CACHE_CASE_RESOLUTION_STATS["memo_hits"] += 1
return cached
# Stale entry -- drop it and re-scan below.
_CACHE_CASE_RESOLUTION_MEMO.pop(model_name, None)

expected_lower = expected_dir.lower()
try:
candidates: list[str] = []
for entry in cache_dir.iterdir():
if not entry.is_dir():
continue
if entry.name.lower() != expected_lower:
continue
if not entry.name.startswith("models--"):
continue
repo_part = entry.name[len("models--") :]
if not repo_part:
continue
candidates.append(repo_part.replace("--", "/"))

if candidates:
# Deterministic tie-break if multiple case variants coexist.
resolved = sorted(candidates)[0]
if len(candidates) > 1:
_CACHE_CASE_RESOLUTION_STATS["tie_breaks"] += 1
_CACHE_CASE_RESOLUTION_STATS["variant_hits"] += 1
if use_memo:
_CACHE_CASE_RESOLUTION_MEMO[model_name] = resolved
return resolved
except Exception as exc:
_CACHE_CASE_RESOLUTION_STATS["errors"] += 1
logger.debug(f"Could not resolve cached repo_id case for '{model_name}': {exc}")

_CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
return model_name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The resolve_cached_repo_id_case function can be improved in several ways:

  1. Support for official models: The check "/" not in model_name prevents casing resolution for official models (e.g., gpt2, bert-base-uncased), which are also stored in the HF cache with the models-- prefix. Removing this check ensures all remote models benefit from casing resolution.
  2. Memoization Efficiency: Normalizing the memoization key to lowercase ensures that subsequent requests with different casing for the same model ID (e.g., Org/Repo vs org/repo) hit the cache correctly.
  3. Performance: Using os.scandir instead of Path.iterdir is more efficient as it avoids creating Path objects for every entry in the cache directory, which can be significant if the cache contains many models.
def resolve_cached_repo_id_case(model_name: str, use_memo: bool = True) -> str:
    """Resolve repo_id to the exact casing already present in local HF cache.

    Policy: prefer the requested/canonical repo_id, but if a case-variant already
    exists in local HF cache, reuse that exact cached spelling. This avoids
    duplicate downloads while preserving user intent whenever possible.
    """
    _CACHE_CASE_RESOLUTION_STATS["calls"] += 1

    if not model_name:
        _CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
        return model_name

    cache_dir = _hf_hub_cache_dir()
    if not cache_dir.exists():
        _CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
        return model_name

    expected_dir = f"models--{model_name.replace('/', '--')}"
    expected_lower = expected_dir.lower()
    memo_key = model_name.lower()

    # Always check the exact-case path first so a newly-appeared exact match
    # wins over any previously memoized variant.
    exact_path = cache_dir / expected_dir
    if exact_path.is_dir():
        if use_memo:
            _CACHE_CASE_RESOLUTION_MEMO[memo_key] = model_name
        _CACHE_CASE_RESOLUTION_STATS["exact_hits"] += 1
        return model_name

    # Validate memoized entries still exist on disk before returning them.
    if use_memo:
        cached = _CACHE_CASE_RESOLUTION_MEMO.get(memo_key)
        if cached is not None:
            cached_path = cache_dir / f"models--{cached.replace('/', '--')}"
            if cached_path.is_dir():
                _CACHE_CASE_RESOLUTION_STATS["memo_hits"] += 1
                return cached
            # Stale entry -- drop it and re-scan below.
            _CACHE_CASE_RESOLUTION_MEMO.pop(memo_key, None)

    try:
        candidates: list[str] = []
        # Use os.scandir for better performance on large cache directories
        with os.scandir(cache_dir) as it:
            for entry in it:
                if not entry.is_dir():
                    continue
                if entry.name.lower() == expected_lower:
                    # Extract the repo part (everything after 'models--')
                    repo_part = entry.name[8:]
                    candidates.append(repo_part.replace("--", "/"))

        if candidates:
            # Deterministic tie-break if multiple case variants coexist.
            resolved = sorted(candidates)[0]
            if len(candidates) > 1:
                _CACHE_CASE_RESOLUTION_STATS["tie_breaks"] += 1
            _CACHE_CASE_RESOLUTION_STATS["variant_hits"] += 1
            if use_memo:
                _CACHE_CASE_RESOLUTION_MEMO[memo_key] = resolved
            return resolved
    except Exception as exc:
        _CACHE_CASE_RESOLUTION_STATS["errors"] += 1
        logger.debug(f"Could not resolve cached repo_id case for '{model_name}': {exc}")

    _CACHE_CASE_RESOLUTION_STATS["fallbacks"] += 1
    return model_name
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
  2. A simple dictionary cache without locks is acceptable for idempotent operations where the risk of benign race conditions and negligible memory growth is acceptable.

@danielhanchen danielhanchen merged commit a29b4e2 into unslothai:main Apr 3, 2026
1 check passed
danielhanchen added a commit that referenced this pull request Apr 3, 2026
Tests for resolve_cached_repo_id_case and get_model_config case
resolution, separated from the runtime changes in PR #4822.
danielhanchen added a commit that referenced this pull request Apr 6, 2026
…#4853)

* Add vision detection cache to is_vision_model() to avoid redundant subprocess spawns

is_vision_model() is called 4-5 times per training run for the same model
with zero caching. For transformers 5.x models, each call spawns a full
subprocess (~6s each). This adds a module-level _vision_detection_cache dict
following the same pattern as the existing _audio_detection_cache used by
detect_audio_type(). The function is refactored into a thin cache wrapper
around _is_vision_model_uncached(), saving ~12s per training run.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Include hf_token in vision cache key for gated model correctness

Cache key is now (model_name, hf_token) instead of just model_name.
This prevents stale False results when an unauthenticated probe for a
gated model is followed by an authenticated call.

* Remove test file from main PR - will be submitted separately

* Fix vision cache: normalize model names and skip caching transient failures

- Normalize model names in cache key using resolve_cached_repo_id_case()
  to avoid duplicate entries for different casings of the same HF repo
  (aligns with case normalization from #4822)
- Return None instead of False on transient failures (network errors,
  subprocess timeouts, HF API issues) so the cache layer can distinguish
  "definitely not a vision model" from "failed to check"
- Only cache definitive True/False results; transient failures are retried
  on the next call instead of being permanently locked in as False

* Refine failure handling: cache deterministic failures, guard normalization

- Subprocess non-zero exit, JSON errors, and general exceptions return
  False (deterministic, cached) instead of None (retryable). Only
  subprocess.TimeoutExpired returns None since timeouts are transient.
- Wrap cache key normalization in try/except so resolve_cached_repo_id_case
  or normalize_path failures fall back to raw model_name instead of
  crashing callers.

* Harden vision detection cache: fix transient failure handling, thread safety, token security

- All subprocess failure paths now return None (transient) instead of False,
  preventing permanent misclassification of VLMs after temporary HF/auth/network errors
- Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token
- Add threading.Lock with double-checked locking to prevent thundering herd
  of concurrent subprocess spawns for the same uncached model
- Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError,
  ValueError) from transient ones in _is_vision_model_uncached
- Pass resolved/normalized model name to detection (not just cache key)
- Log normalization fallback at debug level instead of silent swallow
- Thread hf_token through callers in routes/models.py and trainer.py
  that previously omitted it

* Refine lock strategy and token fingerprint

- Move detection computation outside the lock to avoid serializing
  long-running subprocess spawns (60s timeout) and HF API calls across
  all concurrent model checks. Lock is now only held for cache writes.
- Use full SHA256 digest for token fingerprint instead of truncated
  16-char prefix to eliminate collision risk.

* Fix huggingface_hub import fallback and use atomic cache read

- Add fallback import path for RepositoryNotFoundError/GatedRepoError
  from huggingface_hub.utils (older hub versions) when .errors is
  not available
- Use sentinel-based dict.get() for single atomic cache read instead
  of two-step in/[] pattern (future-proof for no-GIL runtimes)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
…nslothai#4822)

* fix(studio): reuse HF cached repo casing to prevent duplicate downloads

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move cache case resolution tests to separate PR

Tests for resolve_cached_repo_id_case and get_model_config case resolution
belong in their own PR to keep this change focused on the runtime fix.

* fix(studio): debug-log HF_HUB_CACHE fallback in path_utils

* Fix stale memoization in resolve_cached_repo_id_case

- Check exact-case path before memo to ensure a newly-appeared exact
  match always wins over a previously memoized variant
- Validate memoized entries still exist on disk before returning them
  to prevent stale results when cache dirs are deleted/recreated

* Minor cleanups for cache case resolution

- Use .is_dir() instead of .exists() for exact-case cache check
  (cache entries are always directories)
- Remove redundant fallback in _detect_audio_from_tokenizer since
  get_cache_path already handles case resolution and returns None
  when the model is not cached

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
…ai#4823)

Tests for resolve_cached_repo_id_case and get_model_config case
resolution, separated from the runtime changes in PR unslothai#4822.
shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
…unslothai#4853)

* Add vision detection cache to is_vision_model() to avoid redundant subprocess spawns

is_vision_model() is called 4-5 times per training run for the same model
with zero caching. For transformers 5.x models, each call spawns a full
subprocess (~6s each). This adds a module-level _vision_detection_cache dict
following the same pattern as the existing _audio_detection_cache used by
detect_audio_type(). The function is refactored into a thin cache wrapper
around _is_vision_model_uncached(), saving ~12s per training run.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Include hf_token in vision cache key for gated model correctness

Cache key is now (model_name, hf_token) instead of just model_name.
This prevents stale False results when an unauthenticated probe for a
gated model is followed by an authenticated call.

* Remove test file from main PR - will be submitted separately

* Fix vision cache: normalize model names and skip caching transient failures

- Normalize model names in cache key using resolve_cached_repo_id_case()
  to avoid duplicate entries for different casings of the same HF repo
  (aligns with case normalization from unslothai#4822)
- Return None instead of False on transient failures (network errors,
  subprocess timeouts, HF API issues) so the cache layer can distinguish
  "definitely not a vision model" from "failed to check"
- Only cache definitive True/False results; transient failures are retried
  on the next call instead of being permanently locked in as False

* Refine failure handling: cache deterministic failures, guard normalization

- Subprocess non-zero exit, JSON errors, and general exceptions return
  False (deterministic, cached) instead of None (retryable). Only
  subprocess.TimeoutExpired returns None since timeouts are transient.
- Wrap cache key normalization in try/except so resolve_cached_repo_id_case
  or normalize_path failures fall back to raw model_name instead of
  crashing callers.

* Harden vision detection cache: fix transient failure handling, thread safety, token security

- All subprocess failure paths now return None (transient) instead of False,
  preventing permanent misclassification of VLMs after temporary HF/auth/network errors
- Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token
- Add threading.Lock with double-checked locking to prevent thundering herd
  of concurrent subprocess spawns for the same uncached model
- Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError,
  ValueError) from transient ones in _is_vision_model_uncached
- Pass resolved/normalized model name to detection (not just cache key)
- Log normalization fallback at debug level instead of silent swallow
- Thread hf_token through callers in routes/models.py and trainer.py
  that previously omitted it

* Refine lock strategy and token fingerprint

- Move detection computation outside the lock to avoid serializing
  long-running subprocess spawns (60s timeout) and HF API calls across
  all concurrent model checks. Lock is now only held for cache writes.
- Use full SHA256 digest for token fingerprint instead of truncated
  16-char prefix to eliminate collision risk.

* Fix huggingface_hub import fallback and use atomic cache read

- Add fallback import path for RepositoryNotFoundError/GatedRepoError
  from huggingface_hub.utils (older hub versions) when .errors is
  not available
- Use sentinel-based dict.get() for single atomic cache read instead
  of two-step in/[] pattern (future-proof for no-GIL runtimes)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants