Add tests for is_vision_model() caching behaviour by danielhanchen · Pull Request #6 · unslothai/unsloth-staging-1

danielhanchen · 2026-04-06T05:20:56Z

Staging mirror of unslothai#4855

Original PR: unslothai#4855
Author: @danielhanchen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

Summary

Adds 11 pytest tests for the is_vision_model() caching layer introduced in [Studio][Optimization]Add vision detection cache to is_vision_model() unsloth#4853
Tests cover cache hits/misses, False caching, subprocess path, exception handling, direct detection path, audio exclusion, and token handling
Separated from [Studio][Optimization]Add vision detection cache to is_vision_model() unsloth#4853 to keep the main PR focused on the implementation change

Test plan

Run pytest studio/backend/tests/test_vision_cache.py -v after merging [Studio][Optimization]Add vision detection cache to is_vision_model() unsloth#4853
These tests depend on the _vision_detection_cache and _is_vision_model_uncached exports from [Studio][Optimization]Add vision detection cache to is_vision_model() unsloth#4853

- Remove unused _make_config() helper function (dead code) - Fix test_exception_result_cached to actually exercise the exception path by mocking load_model_config to raise OSError instead of using side_effect=[False] which only tested normal False returns

for more information, see https://pre-commit.ci

Use MagicMock(spec=[]) for all config mocks so hasattr() only returns True for explicitly set attributes. Without this, MagicMock defaults make all hasattr checks truthy, allowing tests to pass via unintended detection paths (e.g. img_processor instead of vision_config).

+from utils.models.model_config import (
+    is_vision_model,
+    _is_vision_model_uncached,
+    _vision_detection_cache,
+)


…for PR unslothai#5754 Round 7 reviewer surfaced a handful of swap-window races, fail-open guards, and seed precision mismatches. This commit closes them. Lifecycle / state (P1) * core/inference/diffusion.py: status() now emits active_repo_id, active_base_repo, pending_repo_id, pending_base_repo, and pending_gguf_filename alongside the existing UI-facing fields. During a swap (model A loaded, model B loading) the previous coalesced 'repo_id or pending_repo_id' hid the loading target from delete guards. Splitting the fields lets guards block deletion of either repo currently owned by the backend. * core/inference/diffusion.py: generate_image() now takes _generate_lock BEFORE snapshotting _pipe / _device. Snapshotting outside the lock let a concurrent unload/load clear or replace the backend between the snapshot and the forward, so the freed or swapped pipeline would still run. Symmetric handoffs (P1) * routes/export.py: training-active check now runs BEFORE the chat / inference / diffusion unload helpers, so a 409 does not leave the user's chat session torn down for nothing. Also explicitly fails CLOSED with 503 when is_training_active() raises. * routes/inference.py: _raise_if_training_active now fails closed with 503 when the training backend is importable but its status check raises. The previous best-effort log-and-continue could let chat / diffusion loads collide with unverifiable training. Delete guards (P1) * routes/models.py /delete-cached: chat guard now also blocks when llama-server is_active (i.e. mid-download) and when the inference backend's loading_models set contains the target. Round 7 review #7 flagged that the PR's diffusion-side loading guard had no chat-side parallel, so deleting a chat repo while it was downloading could still race the cache. * routes/models.py /delete-cached: diffusion guard iterates the new active_* + pending_* status fields so a delete during a swap is refused on either repo. * routes/models.py /delete-finetuned: same active_+ pending handling, plus the guard now also refuses deletes of a parent directory that contains the loaded pipeline (round 7 review #6: rm -rf /exports/flux-model/ could unlink model_index.json that the live pipeline is reading via mmap). Seed precision (P2) * models/inference.py + routes/inference.py: DiffusionGenerate- Response now carries seed_str alongside the existing numeric seed. Seeds above Number.MAX_SAFE_INTEGER are rounded by JSON.parse in the browser; seed_str ships full decimal precision for display and reproduction. * frontend/api.ts: DiffusionGenerateResponse types seed_str; images-page.tsx prefers seed_str over seed in the figure caption so the displayed value reproduces the image. * frontend/api.ts: stringifyWithBigInt no longer regex-replaces sentinel strings over the full JSON output. It pulls the seed BigInt out, JSON-serialises the remaining payload, and splices the seed's decimal digits into the resulting object literal at the known position. Avoids the round 7 #10 case where a user-supplied prompt equal to '__bigint__:123' was rewritten into a JSON integer and rejected as a non-string prompt. Custom HF repo (P2) * frontend/images-page.tsx: custom panel now exposes a 'Base diffusers repo' input that maps to DiffusionLoadRequest. base_repo. Required when a private / mirrored GGUF needs a non-default base (e.g. a 9B Klein transformer would otherwise fall back to the 4B base default).

…guard for PR unslothai#5754 Round 9 reviewer flagged a pile of handoff asymmetries: every GPU-owning lifecycle change (training, export, chat, images) needed its own bespoke unload sequence and they had drifted out of sync. Some skipped llama-server is_active; some missed safetensors loading_models; export and training did not check is_export_active. Backend handoff (P1) * routes/inference.py: new _release_chat_for / _release_export_for helpers. Both treat llama-server as held when is_loaded OR is_active, safetensors as held when active_model_name OR loading_models is non-empty, and export as held when current_checkpoint OR is_export_active. Both helpers run their unloads in worker threads so async routes do not block the event loop. * routes/training.py: replaces its bespoke inline llama / safe / export unload sequence with await _release_chat_for / _release_ export_for. * routes/export.py: same swap for the chat unload chain (export still does NOT call _release_export_for on itself). * routes/inference.py GGUF + standard chat-load paths: now use _release_export_for to drop a settled export, and the standard path's llama unload now also handles is_active=True (round 9 review #8). Backend reject-on-active export (P1 #5) * routes/inference.py: new _raise_if_export_active. Symmetric with _raise_if_training_active: a long-running export is refused with HTTP 409 instead of being silently killed when /images/load or /load arrives. Diffusion / images load and both chat-load paths call it. * core/inference/diffusion.py _release_other_gpu_owners_for_ diffusion: no longer tears down an in-flight export job. Only drops a SETTLED export checkpoint (current_checkpoint populated, is_export_active False). Round 9 review #5 -- the previous behavior could terminate an in-flight export and leave a partial output artifact. Token leak via logger.exception (P1 #6) * core/inference/diffusion.py: load-failure logging now uses logger.error(..., exc_msg) with the already-scrubbed string and exc_info=False. logger.exception() with the raw Exception would expose any hf_... token that diffusers / huggingface_hub embedded in the message or traceback locals, defeating the earlier in-flight scrub. Dependency pinning (P1 #11) * pyproject.toml: huggingfacenotorch optional extra now pins diffusers>=0.37.0. Previously the floor was only set in studio/backend/requirements/no-torch-runtime.txt, so a normal pip install would resolve diffusers 0.36.0 (no Flux2KleinPipeline) and the default curated FLUX.2 klein Images model would fail at runtime. Cache-delete exact match (P1 #14) * routes/models.py /delete-cached: llama.cpp and safetensors guards now match on exact repo-id (case-insensitive) instead of prefix. Diffusion guard already does this; the chat guards were the remaining surface where loading org/model-v2 blocked deleting org/model.

P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model`` did not return False AND that ``is_loaded`` / ``is_active`` / ``loading_model_identifier`` are all cleared after the call. The previous version only treated raised exceptions as failure, so a subprocess refusing to terminate or an in-flight GGUF download let the next workload allocate on top. P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` now raises RuntimeError when ``exp._shutdown_subprocess`` fails on a settled checkpoint. Direct backend callers used to log at debug level and proceed toward diffusion allocation while the export checkpoint still owned VRAM. P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export before the cheap backend validation runs. ``DiffusionBackend.load_model`` already calls the strict ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` helpers AFTER family inference and GGUF filename checks pass, so the GPU is still freed before allocation and a malformed payload no longer silently unloads the user's chat / chat-export pair. P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a post-unload state where ``loading_model_identifier`` is still set, matching the route-level ``_release_llama_for`` strictness. A GGUF download mid-flight before the diffusion handoff used to slip through and end up double-owning VRAM after diffusion allocated. P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload ``status()`` failure as ``after = {}``. Training / chat / export handoffs need proof that the diffusion pipeline released VRAM; the helper now raises HTTP 503 when the verification status call itself raises, so the caller retries. P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` raises RuntimeError when ``get_export_backend()`` itself raises. Direct backend callers used to silently ``return`` here and proceed to GPU allocation without being able to verify export ownership. P1 #8: ``/training/start`` releases settled export BEFORE chat, matching the chat-load helpers. If idle export shutdown fails the user's chat model is preserved instead of being dropped for a training run that never starts. P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``, the resolved HF cache path passed to ``transformer_cls.from_single_file()``. Without this an exception like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf`` would leak the operator's filesystem layout through ``last_error`` and ``/images/status``. All 85 diffusion-relevant backend tests pass locally.

P1 #1: ``_release_safetensors_chat_for`` now re-reads ``active_model_name`` and ``loading_models`` after each unload AND runs a final sweep against the initial owned-name set. The previous helper trusted ``unload_model() -> True`` even though the orchestrator can respond ``unloaded`` while still holding weights or a concurrent ``load`` can repopulate the tracker between calls. Per-name and global post-state mismatches now raise HTTP 503 so the caller retries. P1 #2: same post-state guarantee inside ``_release_chat_backend_for_diffusion`` for direct backend callers. ``DiffusionBackend.load_model`` now raises RuntimeError when the safetensors tracker still owns a previously-resident name after the unload, matching the route-level helper. The route layer's existing classifier maps the new wording to HTTP 503. P1 #3: ``DiffusionBackend.load_model`` now preflights the full diffusers repo (or explicit GGUF ``base_repo``) via ``hf_hub_download(filename="model_index.json")`` BEFORE the chat / export unload runs. The GGUF path was already covered by the existing ``hf_hub_download(gguf_filename)`` round-trip; the full-repo path used to skip validation and let a typo / private / gated repo only surface inside ``from_pretrained`` AFTER the user's chat model was already dropped. Local paths are checked structurally (must be a directory containing ``model_index.json``) so we do not network-round-trip for an on-disk miss. Error messages route through ``_display_repo_id`` so an absolute filesystem path does not leak the operator's layout. P1 #6: ``/api/inference/unload`` (the direct chat unload endpoint) now treats ``unload_model() -> False`` AND a leftover state (``is_loaded`` / ``is_active`` / ``loading_model_identifier`` for GGUF, ``active_model_name`` / ``loading_models`` for safetensors) as 503 instead of unconditionally responding ``status="unloaded"``. The UI used to show the model as gone while the backend still owned VRAM. P2 #7: extended the /images/load RuntimeError -> HTTPException marker list with ``still active or loading after unload`` and ``still loading after unload``. Round 18 introduced these exact phrasings on the backend side; without the extension a retryable unload failure was returning HTTP 400 to the user instead of 503. P2 #8: removed the unused ``unsloth_backend = get_inference_backend()`` eager construction in the GGUF chat-load branch. Eager construction made the GGUF-only path needlessly fail or pay startup cost when the safetensors backend was unavailable / lazy; ``_release_safetensors_chat_for`` already handles that case as a no-op. All 85 diffusion-relevant + 98 related backend tests pass locally.

P1 #1: ``_preflight_full_diffusers_repo(effective_base, hf_token)`` now runs for every load mode, including the GGUF-with-auto-base path. Round 19 only preflighted the full repo or an explicit ``base_repo``, so an auto-picked companion that turned out to be gated / private / missing still unloaded the user's chat model before ``from_pretrained`` failed. ``effective_base`` is the same value that feeds every downstream allocation, so preflighting it unconditionally catches all three modes. P1 #2: ``diffusers.GGUFQuantizationConfig`` (which imports the ``gguf`` package at construction time) is now built up front, inside the same try block that surfaces "Re-run Studio setup". Previously the missing-dependency exception fired AFTER ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` had already taken the chat / export models down. The downstream from_single_file call reuses the same ``quant_config`` reference. P1 #4: ``studio/backend/requirements/studio.txt`` now lists ``diffusers>=0.37.0`` and ``gguf>=0.10.0``. These were only in the extras files, so fresh standard Studio installs failed on /images/load with the round 20 P1 #2 dependency error message. P1 #5: ``LoadRequest``, ``UnloadRequest``, and ``ValidateModelRequest`` now apply the same control-character + embedded-HF-token validators that ``DiffusionLoadRequest`` already had. /api/inference/load, /api/inference/validate, and /api/inference/unload used to accept newline / tab / control characters in ``model_path`` (log-line smuggling) and URL-form ``https://hf_xxxxx@huggingface.co/...`` (credential leak through structured log sinks). P2 #6: ``_collapse_local`` in the diffusion load-error scrubber now resolves relative candidates and adds the absolute form to the substring set. A relative ``exports/my-flux`` used to leak ``/mnt/disks/.../exports/my-flux/...`` via downstream library errors because the scrubber only matched the original literal. Replacement is longest-first so a leaf-only context survives. All 85 diffusion-relevant + 35 related model-validation tests pass locally. (P1 #3 cross-workload GPU handoff lock is deferred: deserves a focused design pass across /images/load, /chat/load (both branches), /training/start, and /export/load to pick a lock boundary that does not deadlock against the backend load locks or stall the SSE log stream.)

P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and ``ValidateModelRequest._no_embedded_hf_tokens`` now cover ``gguf_variant`` in addition to ``model_path``. A caller could pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into structured log sinks via the GGUF resolver path; the matching ``DiffusionLoadRequest`` validator already covered every string field, so this restores parity. P1 #3: ``/api/inference/unload`` now also matches the llama ``loading_model_identifier`` when picking the GGUF branch. A pending GGUF download (``is_active`` still False, ``loading_model_identifier`` populated) used to fall through to the safetensors branch and respond ``status="unloaded"`` while llama-server kept downloading. P1 #4 + #5: the final safetensors-handoff sweeps (route-level ``_release_safetensors_chat_for`` and backend ``_release_chat_backend_for_diffusion``) now check ``active_model_name`` and ``loading_models`` WITHOUT the initial ``owned_names`` filter. A concurrent ``/load`` that landed AFTER the snapshot was previously ignored, so a chat model that began loading during the unload window let training / export / GGUF chat / diffusion start anyway and race the new chat for VRAM. P2 #6: added ``_preflight_diffusers_subfolder_config`` and invoked it for GGUF loads with a transformer class (``effective_base``, ``"transformer"``). A custom base companion that had ``model_index.json`` but lacked ``transformer/config.json`` previously passed the round 19 preflight, unloaded chat, then failed inside ``from_single_file``. P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string dict KEYS. Pydantic ``string_type`` errors surface ``input`` verbatim, and a malformed payload like ``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak the token through the 422 response body. All 85 diffusion-relevant + 35 model-validation tests pass locally. Existing fakes for ``hf_hub_download`` updated to accept the new ``subfolder=`` kwarg the round 21 preflight uses. (P1 #3 cross-workload GPU handoff lock from round 20 is still deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee, which closes the most common race without the deadlock risk of holding a process-wide lock across the entire load.)

P1 #1 + #2 + #6: extended the chat / diffusion / training identifier hardening to every export-side request model. ExportCommonOptions (parent of ExportMergedModelRequest / ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies _no_control_chars and _reject_embedded_hf_token to repo_id and base_model_id; ExportGGUFRequest gets the same on its repo_id plus a control-char check on quantization_method; and LoadCheckpointRequest validates checkpoint_path. Previously "/api/export/*" accepted newline-smuggled identifiers and URL-form ``hf_xxxxx`` tokens that flowed into log lines. P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor`` now use a shared ``_gpu_workload_busy_for_helper`` that gates on diffusion (round 22 already), training, AND export. The round 22 guard only checked diffusion, so the dataset helper / advisor could still load llama-server on top of an active training run or a resident export checkpoint. Each step fails closed (unverifiable status counts as busy) so the user's primary workload is preserved. P1 #5: PublishDatasetRequest in models/data_recipe.py also applies the identifier hardening to repo_id; the publish path previously accepted control characters and URL-form tokens. P1 #7-10: added _validate_logged_identifier helper to routes/models.py and applied it to the path / query parameter endpoints that flow into logger.info(...) calls -- ``/config/{model_name}``, ``/check-vision/{model_name}``, ``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped the validator's ValueError to HTTP 422 so the client sees the same shape as a Pydantic validation failure. P2 #11 + #12: ``Loading diffusion model %s`` and ``Diffusion load failed for %s`` log lines route ``repo_id`` / ``effective_base`` through ``_display_repo_id`` (collapses absolute local paths to the leaf, still scrubs HF tokens) instead of plain ``_redact_hf_tokens``. The error path was already collapsed in the user-facing 400 / RuntimeError, but the structured-log lines kept the full path. All 97 diffusion + training-validation + related tests pass locally.

P1 #1: ``_gpu_workload_busy_for_helper`` in ``utils/datasets/llm_assist.py`` now also gates on the GGUF chat backend (llama-server) AND the safetensors chat backend. Round 23 extended it to training + export but missed Chat, so a helper / advisor GGUF could still race a loaded chat model for VRAM. Both checks fail closed when status is unverifiable. P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff unloads so the diffusion release runs BEFORE the chat releases. A wedged diffusion unload used to fire AFTER chat was already gone, so the user lost both on a single failure. Drop chat last so an earlier failure preserves it. Applied to ``/training/start`` (training.py), ``/export/load`` (export.py), ``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch (routes/inference.py). P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens ``model_path`` and ``gguf_variant`` via the shared ``_validate_logged_identifier`` helper, so control characters and URL-form HF tokens can no longer log-line-smuggle. P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and ``variant`` the same way. P1 #9: ``/download-progress`` ``repo_id`` query parameter is also hardened; the value flows into log lines deep inside ``_get_repo_size_cached`` on lookup failure. P1 #11: ``CheckFormatRequest.dataset_name`` and ``AiAssistMappingRequest.{dataset_name, model_name}`` in ``models/datasets.py`` now apply the same control-char + embedded-HF-token validators, matching every other public request-body model. All 115 diffusion + training-validation + cached_gguf + export + inference model-validation tests pass locally. (P1 #6 native-path-lease enforcement for diffusion local paths and P1 #12 React Compiler frontend lint deferred -- both need focused design / frontend touchups separate from this batch.)

P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model`` did not return False AND that ``is_loaded`` / ``is_active`` / ``loading_model_identifier`` are all cleared after the call. The previous version only treated raised exceptions as failure, so a subprocess refusing to terminate or an in-flight GGUF download let the next workload allocate on top. P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` now raises RuntimeError when ``exp._shutdown_subprocess`` fails on a settled checkpoint. Direct backend callers used to log at debug level and proceed toward diffusion allocation while the export checkpoint still owned VRAM. P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export before the cheap backend validation runs. ``DiffusionBackend.load_model`` already calls the strict ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` helpers AFTER family inference and GGUF filename checks pass, so the GPU is still freed before allocation and a malformed payload no longer silently unloads the user's chat / chat-export pair. P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a post-unload state where ``loading_model_identifier`` is still set, matching the route-level ``_release_llama_for`` strictness. A GGUF download mid-flight before the diffusion handoff used to slip through and end up double-owning VRAM after diffusion allocated. P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload ``status()`` failure as ``after = {}``. Training / chat / export handoffs need proof that the diffusion pipeline released VRAM; the helper now raises HTTP 503 when the verification status call itself raises, so the caller retries. P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` raises RuntimeError when ``get_export_backend()`` itself raises. Direct backend callers used to silently ``return`` here and proceed to GPU allocation without being able to verify export ownership. P1 #8: ``/training/start`` releases settled export BEFORE chat, matching the chat-load helpers. If idle export shutdown fails the user's chat model is preserved instead of being dropped for a training run that never starts. P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``, the resolved HF cache path passed to ``transformer_cls.from_single_file()``. Without this an exception like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf`` would leak the operator's filesystem layout through ``last_error`` and ``/images/status``. All 85 diffusion-relevant backend tests pass locally.

P1 #1: ``_release_safetensors_chat_for`` now re-reads ``active_model_name`` and ``loading_models`` after each unload AND runs a final sweep against the initial owned-name set. The previous helper trusted ``unload_model() -> True`` even though the orchestrator can respond ``unloaded`` while still holding weights or a concurrent ``load`` can repopulate the tracker between calls. Per-name and global post-state mismatches now raise HTTP 503 so the caller retries. P1 #2: same post-state guarantee inside ``_release_chat_backend_for_diffusion`` for direct backend callers. ``DiffusionBackend.load_model`` now raises RuntimeError when the safetensors tracker still owns a previously-resident name after the unload, matching the route-level helper. The route layer's existing classifier maps the new wording to HTTP 503. P1 #3: ``DiffusionBackend.load_model`` now preflights the full diffusers repo (or explicit GGUF ``base_repo``) via ``hf_hub_download(filename="model_index.json")`` BEFORE the chat / export unload runs. The GGUF path was already covered by the existing ``hf_hub_download(gguf_filename)`` round-trip; the full-repo path used to skip validation and let a typo / private / gated repo only surface inside ``from_pretrained`` AFTER the user's chat model was already dropped. Local paths are checked structurally (must be a directory containing ``model_index.json``) so we do not network-round-trip for an on-disk miss. Error messages route through ``_display_repo_id`` so an absolute filesystem path does not leak the operator's layout. P1 #6: ``/api/inference/unload`` (the direct chat unload endpoint) now treats ``unload_model() -> False`` AND a leftover state (``is_loaded`` / ``is_active`` / ``loading_model_identifier`` for GGUF, ``active_model_name`` / ``loading_models`` for safetensors) as 503 instead of unconditionally responding ``status="unloaded"``. The UI used to show the model as gone while the backend still owned VRAM. P2 #7: extended the /images/load RuntimeError -> HTTPException marker list with ``still active or loading after unload`` and ``still loading after unload``. Round 18 introduced these exact phrasings on the backend side; without the extension a retryable unload failure was returning HTTP 400 to the user instead of 503. P2 #8: removed the unused ``unsloth_backend = get_inference_backend()`` eager construction in the GGUF chat-load branch. Eager construction made the GGUF-only path needlessly fail or pay startup cost when the safetensors backend was unavailable / lazy; ``_release_safetensors_chat_for`` already handles that case as a no-op. All 85 diffusion-relevant + 98 related backend tests pass locally.

P1 #1: ``_preflight_full_diffusers_repo(effective_base, hf_token)`` now runs for every load mode, including the GGUF-with-auto-base path. Round 19 only preflighted the full repo or an explicit ``base_repo``, so an auto-picked companion that turned out to be gated / private / missing still unloaded the user's chat model before ``from_pretrained`` failed. ``effective_base`` is the same value that feeds every downstream allocation, so preflighting it unconditionally catches all three modes. P1 #2: ``diffusers.GGUFQuantizationConfig`` (which imports the ``gguf`` package at construction time) is now built up front, inside the same try block that surfaces "Re-run Studio setup". Previously the missing-dependency exception fired AFTER ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` had already taken the chat / export models down. The downstream from_single_file call reuses the same ``quant_config`` reference. P1 #4: ``studio/backend/requirements/studio.txt`` now lists ``diffusers>=0.37.0`` and ``gguf>=0.10.0``. These were only in the extras files, so fresh standard Studio installs failed on /images/load with the round 20 P1 #2 dependency error message. P1 #5: ``LoadRequest``, ``UnloadRequest``, and ``ValidateModelRequest`` now apply the same control-character + embedded-HF-token validators that ``DiffusionLoadRequest`` already had. /api/inference/load, /api/inference/validate, and /api/inference/unload used to accept newline / tab / control characters in ``model_path`` (log-line smuggling) and URL-form ``https://hf_xxxxx@huggingface.co/...`` (credential leak through structured log sinks). P2 #6: ``_collapse_local`` in the diffusion load-error scrubber now resolves relative candidates and adds the absolute form to the substring set. A relative ``exports/my-flux`` used to leak ``/mnt/disks/.../exports/my-flux/...`` via downstream library errors because the scrubber only matched the original literal. Replacement is longest-first so a leaf-only context survives. All 85 diffusion-relevant + 35 related model-validation tests pass locally. (P1 #3 cross-workload GPU handoff lock is deferred: deserves a focused design pass across /images/load, /chat/load (both branches), /training/start, and /export/load to pick a lock boundary that does not deadlock against the backend load locks or stall the SSE log stream.)

P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and ``ValidateModelRequest._no_embedded_hf_tokens`` now cover ``gguf_variant`` in addition to ``model_path``. A caller could pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into structured log sinks via the GGUF resolver path; the matching ``DiffusionLoadRequest`` validator already covered every string field, so this restores parity. P1 #3: ``/api/inference/unload`` now also matches the llama ``loading_model_identifier`` when picking the GGUF branch. A pending GGUF download (``is_active`` still False, ``loading_model_identifier`` populated) used to fall through to the safetensors branch and respond ``status="unloaded"`` while llama-server kept downloading. P1 #4 + #5: the final safetensors-handoff sweeps (route-level ``_release_safetensors_chat_for`` and backend ``_release_chat_backend_for_diffusion``) now check ``active_model_name`` and ``loading_models`` WITHOUT the initial ``owned_names`` filter. A concurrent ``/load`` that landed AFTER the snapshot was previously ignored, so a chat model that began loading during the unload window let training / export / GGUF chat / diffusion start anyway and race the new chat for VRAM. P2 #6: added ``_preflight_diffusers_subfolder_config`` and invoked it for GGUF loads with a transformer class (``effective_base``, ``"transformer"``). A custom base companion that had ``model_index.json`` but lacked ``transformer/config.json`` previously passed the round 19 preflight, unloaded chat, then failed inside ``from_single_file``. P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string dict KEYS. Pydantic ``string_type`` errors surface ``input`` verbatim, and a malformed payload like ``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak the token through the 422 response body. All 85 diffusion-relevant + 35 model-validation tests pass locally. Existing fakes for ``hf_hub_download`` updated to accept the new ``subfolder=`` kwarg the round 21 preflight uses. (P1 #3 cross-workload GPU handoff lock from round 20 is still deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee, which closes the most common race without the deadlock risk of holding a process-wide lock across the entire load.)

P1 #1 + #2 + #6: extended the chat / diffusion / training identifier hardening to every export-side request model. ExportCommonOptions (parent of ExportMergedModelRequest / ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies _no_control_chars and _reject_embedded_hf_token to repo_id and base_model_id; ExportGGUFRequest gets the same on its repo_id plus a control-char check on quantization_method; and LoadCheckpointRequest validates checkpoint_path. Previously "/api/export/*" accepted newline-smuggled identifiers and URL-form ``hf_xxxxx`` tokens that flowed into log lines. P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor`` now use a shared ``_gpu_workload_busy_for_helper`` that gates on diffusion (round 22 already), training, AND export. The round 22 guard only checked diffusion, so the dataset helper / advisor could still load llama-server on top of an active training run or a resident export checkpoint. Each step fails closed (unverifiable status counts as busy) so the user's primary workload is preserved. P1 #5: PublishDatasetRequest in models/data_recipe.py also applies the identifier hardening to repo_id; the publish path previously accepted control characters and URL-form tokens. P1 #7-10: added _validate_logged_identifier helper to routes/models.py and applied it to the path / query parameter endpoints that flow into logger.info(...) calls -- ``/config/{model_name}``, ``/check-vision/{model_name}``, ``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped the validator's ValueError to HTTP 422 so the client sees the same shape as a Pydantic validation failure. P2 #11 + #12: ``Loading diffusion model %s`` and ``Diffusion load failed for %s`` log lines route ``repo_id`` / ``effective_base`` through ``_display_repo_id`` (collapses absolute local paths to the leaf, still scrubs HF tokens) instead of plain ``_redact_hf_tokens``. The error path was already collapsed in the user-facing 400 / RuntimeError, but the structured-log lines kept the full path. All 97 diffusion + training-validation + related tests pass locally.

P1 #1: ``_gpu_workload_busy_for_helper`` in ``utils/datasets/llm_assist.py`` now also gates on the GGUF chat backend (llama-server) AND the safetensors chat backend. Round 23 extended it to training + export but missed Chat, so a helper / advisor GGUF could still race a loaded chat model for VRAM. Both checks fail closed when status is unverifiable. P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff unloads so the diffusion release runs BEFORE the chat releases. A wedged diffusion unload used to fire AFTER chat was already gone, so the user lost both on a single failure. Drop chat last so an earlier failure preserves it. Applied to ``/training/start`` (training.py), ``/export/load`` (export.py), ``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch (routes/inference.py). P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens ``model_path`` and ``gguf_variant`` via the shared ``_validate_logged_identifier`` helper, so control characters and URL-form HF tokens can no longer log-line-smuggle. P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and ``variant`` the same way. P1 #9: ``/download-progress`` ``repo_id`` query parameter is also hardened; the value flows into log lines deep inside ``_get_repo_size_cached`` on lookup failure. P1 #11: ``CheckFormatRequest.dataset_name`` and ``AiAssistMappingRequest.{dataset_name, model_name}`` in ``models/datasets.py`` now apply the same control-char + embedded-HF-token validators, matching every other public request-body model. All 115 diffusion + training-validation + cached_gguf + export + inference model-validation tests pass locally. (P1 #6 native-path-lease enforcement for diffusion local paths and P1 #12 React Compiler frontend lint deferred -- both need focused design / frontend touchups separate from this batch.)

Four actionable findings from round 30. Skipped P1 #1 / #2 / #3 (huggingface-hub bump in studio.txt / single-env / colab-new) because the live B200 Studio that successfully generated FLUX.2 klein images runs the exact combo the reviewer flags as broken: huggingface_hub 0.36.2 + transformers 4.57.6 + diffusers 0.37.1 Flux2KleinPipeline: True (imports cleanly) The is_offline_mode ImportError only fires with transformers 5.x, and the standard install path pins transformers==4.57.6 via constraints. The round 26 fix bumped no-torch-runtime.txt + pyproject huggingfacenotorch where the --no-deps install path can land on transformers 5.x; that remains the correct surface. 1. core/inference/diffusion.py: preflight transformers + accelerate via importlib.util.find_spec BEFORE any destructive GPU-owner unload. Diffusers can expose stub pipeline classes when transformers / accelerate are missing, so the load used to drop chat first and fail later inside from_pretrained. find_spec keeps existing tests that stub these modules passing because no real module is executed (round 30 P1 #11). 2. models/export.py ExportGGUFRequest.quantization_method: extend the embedded HF token validator to this field too. Round 23 added the control-char guard but not the token guard; the value is forwarded into worker command lines and reflected in error / success text (round 30 P1 #5). 3. models/data_recipe.py SeedInspectUploadRequest: add _no_control_chars + _reject_embedded_hf_token field_validators to filename and to each entry of file_names. Mirrors the sibling SeedInspectRequest.dataset_name hardening (round 30 P1 #6). 4. frontend/src/features/images/images-page.tsx: defer the initial refreshStatus() call via queueMicrotask so the synchronous setRefreshingStatus(true) inside it does not trip the react-hooks/set-state-in-effect lint on mount (round 30 P2 #12). Deferred (need larger surgery / out of scope for this round): P1 #4 native_path_lease for diffusion local-path loads P1 #7-#10 helper/advisor + public-start window mutual lock symmetry Tests: 98 targeted (diffusion + cached_gguf + inference_validation) pass locally; frontend npm run typecheck passes.

Two universal-consensus round-31 reviewer findings. Concurrency: /images/load was leaking the public-load pending counter on any pre-finally HTTPException (round 31 P1 #1, 11/12 votes). _raise_if_helper_advisor_busy("diffusion") published the counter, then _resolve_diffusion_repo_for_request ran outside the clearing try/finally. A request like repo_id="/tmp/model" with no native_path_lease returned 400 and left public_load_pending() true until process restart, permanently blocking AI Assist. Fix mirrors the training / export pattern: track diffusion_load_window_published in an outer try, publish the flag right after the helper-busy check succeeds, and clear in an outer finally that only fires when the flag is set. This also closes round 31 P1 #6: a second request's failure can no longer decrement a still-active first request's counter, because the second request has not yet flipped its own publish flag. Security: _looks_like_local_diffusion_path missed cwd-relative directories (round 31 P1 #2, 8/12 votes). DiffusionBackend. load_model accepts repo_id="exports/my-flux" as a local directory via Path(repo_id).expanduser().is_dir(), but the detector only flagged values starting with /, ~, ./, ../, backslash, or absolute. Tightened the detector to also reject: * weight-file suffixes (.gguf / .safetensors / .bin / .pt / .pth) * non-2-segment values (`owner`, `a/b/c`, `owner/`, `/repo`, `//`) * 2-segment values whose parts are `.` or `..` * 2-segment values that actually resolve to an existing local path under backend CWD (last-resort exists() probe). The existence probe is a minor side-channel for an already- authenticated caller, accepted in exchange for closing the silent bypass of the new lease boundary. Valid Hub ids like unsloth/FLUX.2-klein-base-4B-GGUF, microsoft/Phi-3.5-mini-instruct still pass through unchanged. Skipped (consistent with prior rounds): * R31 P1 #3 (Tauri / native lease enum missing `load-diffusion-model` op): architectural surface; defer until the Images page actually surfaces a local-path picker. * R31 P1 #4-#5, #8: studio.txt / constraints.txt / pyproject hub pins. Live B200 install path with huggingface_hub==0.36.2, transformers==4.57.6, diffusers==0.37.1 imports Flux2KleinPipeline cleanly. The is_offline_mode import error only triggers when transformers 5.x is paired with hub 0.x, which the constraints pin prevents. * R31 P1 #7 (find_spec vs real import): a full transformers import at module load breaks tests that stub huggingface_hub; find_spec is the existing tradeoff. 98 targeted backend tests pass (test_diffusion_routes, test_diffusion_backend, test_inference_model_validation, test_models_get_model_config_case_resolution, test_data_recipe_seed, test_training_raw_support, test_export_log_cursor).

Three round-32 reviewer findings, plus documentation cleanup for the local-path Tauri/FE plumbing gap. Concurrency: direct DiffusionBackend.load_model callers now publish the helper/advisor pending marker symmetrically (round 32 P1 #3). _raise_if_helper_advisor_busy_for_diffusion gains an optional publish_pending flag; load_model passes True so the destructive unload window is gated by a "diffusion-backend" tag published under _HELPER_ADVISOR_START_LOCK. The route layer's "diffusion" tag and the backend's "diffusion-backend" tag refcount independently (sum > 0 still blocks helper starts), so neither side's clear can erase the other's still-active marker. The existing _release_chat_backend_for_diffusion(check_helper_advisor= True) path stays snapshot-only (publish_pending defaults False) so test / direct callers of that helper do not leak a counter. Validation: export save_directory now rejects ALL ASCII control characters (round 32 P1, save_directory tab finding). The earlier CR / LF only guard missed TAB / VT / FF / DEL, which a caller could smuggle past the export worker's logged subprocess argv. Documentation: DiffusionLoadRequest.repo_id and base_repo updated to reflect that local-path support is gated on a Tauri / frontend load-diffusion-model directory lease producer that has not shipped yet (round 32 P1 #1 from multiple reviewers). The backend lease boundary is correct; what is missing is the FE / native side that mints the matching grant. Until that lands, local paths through the Images route always 400 with "Native path grant is required", which the docstring now spells out. Skipped (consistent with prior rounds): * Hub-pin findings (R32 P1 #4-#6): live B200 install with huggingface_hub==0.36.2 + transformers==4.57.6 + diffusers== 0.37.1 verifiably imports Flux2KleinPipeline. Empirical justification documented in R30 / R30 follow-up commit msgs. * Tauri / native enum surgery (R32 P1 #1, 6 votes): real architectural work but out of scope for this PR's Python surface. Documented now; FE / Rust ticket to follow. 98 targeted backend tests pass (test_diffusion_routes, test_diffusion_backend, test_inference_model_validation, test_models_get_model_config_case_resolution, test_data_recipe_seed, test_training_raw_support, test_export_log_cursor).

Two round-33 reviewer findings: hub-floor consistency and the multipart upload filename validator gap. Dependencies: reverted the round-26 huggingface_hub>=1.3.0 floor in no-torch-runtime.txt and pyproject.toml (round 33 P1 #1-#5, 4/12 vote consensus). studio.txt forces huggingface_hub==0.36.2 to match the transformers==4.57.6 pin in extras-no-deps.txt, so the 1.3.0 floor was internally inconsistent. Reviewers reproduced the resolver conflict on a fresh install. Empirical justification (re-verified on the live B200 host before the revert): huggingface_hub 0.36.2 + transformers 4.57.6 + diffusers 0.37.1 imports Flux2KleinPipeline cleanly and runs end-to-end image generation. transformers 4.57.6 carries its own transformers.utils.hub.is_offline_mode and does not actually need huggingface_hub.is_offline_mode at import time. The original bump was guarding against the (never-realised) transformers 5.x path, which extras-no-deps explicitly pins away. Validation: multipart /seed/upload-unstructured-file now applies the same _no_control_chars and _reject_embedded_hf_token checks to file.filename that SeedInspectUploadRequest.filename already applies in the JSON variant (round 33 P1 #7). The filename is reflected back to the client, persisted in the per-file meta JSON, and echoed by error responses, so the JSON-side hardening must not be asymmetric with the multipart path. Skipped (consistent with prior rounds): * Find_spec vs full import (R33 P1 #6): preserves test compatibility with the huggingface_hub stub fixture. * React hooks set-state-in-effect lint (R33 P1 #8): codebase has 146 pre-existing violations of the same rule; studio-frontend-ci does not gate on lint. * Direct DiffusionBackend.load_model bypass (R33 P1 #9): the route is the only production entry point, and the backend helper now publishes its own diffusion-backend pending tag (round 32 P1 #3). Direct-caller hardening would require duplicating the lease check into load_model itself, which is out of scope for the route-layer security boundary. * One-segment Hub IDs (R33 P2 #10): strict 2-segment Hub id check is intentional; one-segment names are not valid Hub ids. * Cwd-relative shadow of Hub IDs (R33 P2 #11): documented side-channel tradeoff accepted in round 31 commit msg. 97 targeted backend tests pass.

danielhanchen and others added 4 commits April 5, 2026 02:34

Add tests for is_vision_model() caching behaviour

a25c064

[pre-commit.ci] auto fixes from pre-commit.com hooks

44f0458

for more information, see https://pre-commit.ci

github-code-quality Bot found potential problems Apr 6, 2026

View reviewed changes

Comment thread studio/backend/tests/test_vision_cache.py

Comment on lines +35 to +39

from utils.models.model_config import (

is_vision_model,

_is_vision_model_uncached,

_vision_detection_cache,

)

danielhanchen closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for is_vision_model() caching behaviour#6

Add tests for is_vision_model() caching behaviour#6
danielhanchen wants to merge 4 commits into
mainfrom
pr-4855-head

danielhanchen commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielhanchen commented Apr 6, 2026

Staging mirror of unslothai#4855

Original description

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants