Skip to content

fix embedding matrix size mismatch bug#645

Merged
danielhanchen merged 1 commit into
unslothai:mainfrom
CodeMan62:new-token-bug
May 17, 2026
Merged

fix embedding matrix size mismatch bug#645
danielhanchen merged 1 commit into
unslothai:mainfrom
CodeMan62:new-token-bug

Conversation

@CodeMan62

Copy link
Copy Markdown
Contributor

This PR is fixing the embedding matrix size mismatch bug
closes 3502
image shows that the patch works
image

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0fbebd247

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


# Confirm sizes are correct
if embedding_matrix.shape[0] != (old_input_length + len(new_tokens)):
if embedding_matrix.shape[0] != new_vocab_length:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep validating the number of actually added tokens

When new_tokens contains duplicates such as ['<foo>', '<foo>'] with method='interpolation', tokenizer.add_tokens only adds one row, so new_vocab_length still matches the resized embedding size and this check now passes. The later interpolation loop still enumerates both entries and writes to embedding_matrix[old_length + j], causing an out-of-bounds write on the second duplicate (or initializing rows for tokens that were not added). Please compare new_vocab_length - old_length to the de-duplicated tokens actually added, or filter new_tokens accordingly before the later loop.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the add_new_tokens function in unsloth_zoo/tokenizer_utils.py by introducing a new_vocab_length variable to simplify embedding resizing and validation logic. Feedback indicates that the variables old_input_length, old_output_length, and old_config_size are now unused and should be removed to maintain code cleanliness.

# Add tokens!
old_length = len(tokenizer)
tokenizer.add_tokens(new_tokens)
new_vocab_length = len(tokenizer)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variables old_input_length, old_output_length, and old_config_size (defined at lines 111-113) are no longer used after this change. They should be removed to maintain code cleanliness and avoid confusion for future maintainers.

danielhanchen added a commit that referenced this pull request May 15, 2026
* fix CI fallout from MLX subpackage refactor (#634)

Three small fixes triggered by tests on PRs #645 / unsloth#5426:

1) tests/test_upstream_pinned_symbols_accelerator.py read the legacy
   flat unsloth_zoo/mlx_trainer.py and mlx_loader.py paths directly.
   #634 moved those to unsloth_zoo/mlx/trainer.py and mlx/loader.py;
   point the tests at the new paths so they exercise the post-refactor
   sources.

2) unsloth_zoo/vision_utils.py:68 did an unconditional top-level
   import torchvision while torchvision is not a declared dependency.
   On a CPU-only install that lacks torchvision (e.g. zoo's Core CI
   matrix), every import of unsloth_zoo.vision_utils raises
   ModuleNotFoundError. Guard the import so the module loads; existing
   _read_video_torchvision call sites already only fire when video
   reading is requested.

3) tests/test_zoo_source_upstream_refs.py tries to resolve transformers
   qwen2_vl / qwen2_5_vl image-processing modules, both of which
   transitively `import torchvision`. Skip via pytest.importorskip
   when torchvision is not installed instead of failing the run.

Each fix is independent and minimal; no behavior change for installs
that have torchvision (the eager import path is identical), and no
behavior change for the MLX trainer / loader (the tests still pin
the exact same source-level invariants on the new file paths).

* tests/conftest: also stub unsloth.device_type on CPU-only runners

The post-#634 conftest stubs ``unsloth_zoo.device_type`` so importing
the package on a CPU-only CI runner doesn't trip ``get_device_type()``.
``test_unsloth_trainer_exec_marker`` (and any future test that does
``import unsloth.*``) goes one level up: it imports ``unsloth.trainer``,
which traverses ``unsloth/__init__.py`` → ``_gpu_init`` → ``unsloth/
device_type.py:get_device_type()``, raising the same NotImplementedError
zoo's stub was meant to suppress.

Parameterise ``_preload_real_device_type(package, prereqs)`` so the
same harness works for both packages, factor the fallback into
``_install_device_type_stub(name)``, and call both. unsloth's
``get_device_type()`` consumes ``unsloth_zoo`` helpers but has no
intra-unsloth prereqs, hence ``prereqs=()`` there.

* tests/conftest: don't stub unsloth.device_type after all

The previous commit (196a78b) stubbed unsloth.device_type so
test_unsloth_trainer_exec_marker could complete its
``import unsloth.trainer`` smoke. That makes ``import unsloth`` succeed
on CPU-only CI, which then runs
``unsloth/_gpu_init.py:_patch_trl_trainer()`` and rebinds
``trl.trainer.sft_trainer.SFTTrainer`` /
``transformers.models.ministral.MinistralAttention`` to Unsloth's
compiled wrappers.

``inspect.getsource(...)`` on those classes follows the wrapper's
``__init__.__code__.co_filename`` and returns the rewritten source --
which doesn't contain the literals the drift detectors anchor on
(``self._signature_columns``, ``self._prepare_dataset(``,
``class SFTTrainer``, named ``hidden_states``/``attention_mask``).
The HF=4.57.6 job went from -1 pre-existing failure
(``test_unsloth_trainer_exec_marker``, also failing on main) to -4 new
failures across ``test_MinistralAttention_forward_signature`` and the
three ``test_unsloth_rl_trainer_*`` checks. That's a regression.

Keep parameterised ``_preload_real_device_type`` and
``_install_device_type_stub`` -- they're harmless infrastructure and
useful if a future test needs the unsloth.device_type stub
specifically. Just don't fire the second preload at session setup.

``test_unsloth_trainer_exec_marker`` will continue to fail on
CPU-only CI as it does on main; that's a separate pre-existing issue
with ``unsloth/device_type.py`` raising on accelerator-less hosts and
should be fixed upstream in unsloth itself, not papered over in zoo's
conftest.

* device_type: mirror UNSLOTH_ALLOW_CPU=1 gate from unsloth #5429

unsloth/_gpu_init.py:125 imports DEVICE_TYPE from unsloth_zoo, not
local unsloth/device_type.py, so the env-var gate has to live here
too for `import unsloth.trainer` to succeed on CPU-only CI.

The companion unsloth PR landed as cb15a7a (#5429). Pairs with the
conftest change in this branch that sets UNSLOTH_ALLOW_CPU=1 before
triggering `import unsloth` from _apply_upstream_import_fixes_for_tests.

@functools.cache on get_device_type plus the module-level
DEVICE_TYPE = get_device_type() assignment mean the os.environ.get
runs exactly once per process. Production hosts with a real accelerator
hit the `cuda`/`xpu`/`hip` branches above and never reach the env-var
short-circuit.

* prod: graceful no-op for two transformers 5.x runtime bugs

Two upstream changes broke zoo at runtime (not just CI drift detection):

1. transformers 5.x removed transformers.cache_utils.HybridCache. zoo's
   utils.py falls back to `HybridCache = typing.Any`, which made
   `gemma.py:260 isinstance(past_key_values, HybridCache)` raise
   TypeError ("isinstance() arg 2 must be a type, a tuple of types,
   or a union") at runtime on the gemma mask path.

   Fix: expose HAS_HYBRID_CACHE alongside the existing HybridCache
   fallback in utils.py, and gate the isinstance call in gemma.py:260
   on the flag. When HybridCache is genuinely missing, the elif chain
   falls through to the dynamic-cache branch -- same observable
   behavior as the pre-5.x path that never matched the type.

2. transformers 5.x rewrote PreTrainedModel.save_pretrained. The
   source-string anchors zoo's LoRA-merge-on-save optimization patches
   in `merge_and_dequantize_lora` (state_dict_split assignment,
   `state_dict[tensor].contiguous()`, `def save_pretrained`, and the
   filename_to_tensors for-loop on the push_to_hub path) are all gone.
   Each downstream `raise RuntimeError("Failed to find ...")` would
   fire, killing `merge_and_dequantize_lora(push_to_hub=True)` on 5.x.

   Fix: scan all required anchors upfront in saving_utils.py. If any
   are missing on installed transformers, emit a one-time warnings.warn
   and fall back to vanilla `model.save_pretrained(...)`. The end-user
   sees the warning, no crash, and is told to call
   `model.merge_and_unload()` (or equivalent) before saving if they
   want merged LoRA weights on disk. The per-anchor RuntimeError
   raises downstream are kept as defense-in-depth for partial drift
   (one anchor renamed, others intact) that the upfront check might
   miss.

The two corresponding drift detectors in tests/ are rewritten in the
companion commit as positive assertions: 4.57.6 keeps the strict
existence check; 5.x asserts the anchor is gone AND zoo's fallback
covers it (`HAS_HYBRID_CACHE` flag / `_required_anchors` list).

* tests: version-gate transformers-5.x drift detectors

Sixteen drift detectors in the Core HF=default / HF=latest matrix
slices were pointing at upstream symbols / source strings that
transformers 5.x intentionally removed:

* 9 `temporary_patches/*_for_causal_lm_forward_named_params` --
  cache_position moved from named forward param into
  **kwargs: Unpack[TransformersKwargs] (DeepseekV3, GptOss, CsmDepth,
  Csm, Qwen3Moe, Qwen3Next, Qwen3VLMoe, MinistralModel, GraniteMoeHybrid).
* 4 compiler/source rewriter probes -- output_attentions branching
  removed, MoE routing_weights cast refactored, Trainer
  is_torch_tpu_available/is_torch_xla_available both removed,
  _update_causal_mask hasattr probe all-False.
* 1 routing_weights.to substring probe in compiler_rewriter_exhaustive.
* 1 GptOssConfig dedent-compare -- rope_theta/initial_context_length/
  rope_scaling collapsed into rope_parameters dict.
* 1 GptOssConfig kwarg signature -- same rope_theta rename.

For all 16, the zoo runtime patch already gracefully no-ops on 5.x
(verified by the Explore audit: try/except + relaxed patch_function +
hasattr guards + str.replace's silent no-op-on-miss). Skip the
detector on 5.x with a clear "<symbol> removed on transformers
{version}" message; keep it strict on 4.57.6 where real drift could
still surface.

Two more drift detectors -- test_hybrid_cache_class_present and the
two saving_utils.py pinned-string tests -- become positive assertions:
on 5.x they confirm zoo's prod fix correctly identifies the missing
anchor (HAS_HYBRID_CACHE is False / `_required_anchors` covers the
missing strings) and the graceful fallback is wired up.

Conftest also sets UNSLOTH_ALLOW_CPU=1 before the existing
`import unsloth` trigger, which unblocks `test_unsloth_trainer_exec_marker`
on CPU-only CI runners (the companion unsloth PR #5429 lets that
import succeed without an accelerator).

* tests: also gate test_MinistralAttention_forward_signature on 5.x

Missed this one in 18c9f62. Same root cause as MinistralModel:
transformers 5.x reflowed MinistralAttention.forward, zoo's
``patch_function(..., match_level='relaxed')`` falls back to a
``(self, *args, **kwargs)`` wrapper, the named-param probe sees the
wrapper signature and drift-fails. Runtime call still works because
the wrapper forwards via kwargs. Skip on 5.x, keep strict on 4.57.6.

* tests: probe Ministral stash + skip relaxed-wrapper case

Zoo's ministral.py:94-96 wraps the actual implementation in a generic
``def forward(self, *args, **kwargs): return _full_forward(...)``
adapter before calling patch_function -- this lets `check_args_kwargs`
accept params transformers 5.x removed (cache_position) by routing
them through **kwargs. The side effect: `inspect.signature(
MinistralAttention.forward)` after patching shows the wrapper, not the
real named params.

This test used to pass only because the test runner couldn't fully
trigger zoo's TEMPORARY_PATCHES loop (`unsloth/models/_utils.py:580`
runs at import time but unsloth import was being silently aborted).
With UNSLOTH_ALLOW_CPU=1 now making `import unsloth` succeed, the
patch fires correctly and the live signature is the wrapper.

Fix: probe the `_original_modeling_ministral_MinistralAttention_forward`
stash that `patch_function(store_original=True)` writes, falling back
to the live attribute. If the live attr is the relaxed wrapper and
there's no stash, the named-param probe isn't meaningful -- the
runtime kwargs contract is enforced elsewhere. Skip with a clear
message. 4.57.6: stash is present and the named-param probe still
runs strict; real drift in the upstream signature still surfaces.

* tests: gate qwen2_5_vl image-processor drift on 5.x

transformers 5.x dropped the slow image processors entirely
(image_processing_qwen2_5_vl.py and image_processing_qwen2_5_vl_fast.py
are both gone in v5.5.0). Qwen2.5-VL now reuses Qwen2VLImageProcessor
directly. zoo's misc.py:1500-1506 patch site for Qwen2_5_VLImageProcessor
is try/except ImportError-wrapped, so it silently no-ops on 5.x and the
runtime shim still fires via the Qwen2VLImageProcessor patch at
misc.py:1485-1498 (the class Qwen2.5-VL inherits at runtime).

The Qwen2.5-VL image-processor path was previously hidden by
``pytest.importorskip("torchvision")`` -- PR #648 added torchvision to
the upstream-regression matrix install, which unmasked the drift.
Runtime is unaffected. Add the same 5.x skip pattern used by the other
16 gated detectors so the dead-code patch site doesn't block CI.

4.57.6 keeps the strict path-existence check.
@danielhanchen danielhanchen merged commit fc983dd into unslothai:main May 17, 2026
8 of 11 checks passed
@danielhanchen

Copy link
Copy Markdown
Member

Thanks for the PR - appreciate it!

Brishen pushed a commit to Brishen/unsloth-zoo that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] add_new_tokens -> Embedding matrix size did not get resized properly

2 participants