🚨 [v5] `generate` delegates default cache initialization to the model by gante · Pull Request #41505 · huggingface/transformers

gante · 2025-10-10T09:09:27Z

What does this PR do?

See PR title.

Now that all traces of legacy caches were removed, we can trust the model to initialize its own cache! This means we no longer need to set cache_implementation="xxx" defaults in new models, assuming the model's forward pass defaults to the right cache class.

Also fixes related bugs, uncovered by not feeding a cache to the model.

HuggingFaceDocBuilderDev · 2025-10-10T09:18:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2025-10-10T12:49:03Z

-        if requires_cross_attention_cache and not isinstance(model_kwargs[cache_name], EncoderDecoderCache):
-            model_kwargs[cache_name] = EncoderDecoderCache(
-                model_kwargs[cache_name],  # self-attention cache
+        if (


update to this branch -- we only want to convert the cache to EncoderDecoderCache here if:

the user has set custom cache args in generate

the model is encoder-decoder

(implicitly, all encoder-decoder models have past_key_values as their cache name)

In all other cases, we delegate cache init to the model itself

gante · 2025-10-10T12:50:40Z

@@ -546,8 +546,10 @@ def prepare_inputs_for_generation(
        model_inputs["cache_position"] = cache_position

        # 2. Generic cache-dependent input preparation


Changes in new L549-594 are related to recurrent_gemma: prior to the deletion of the old L2002-2007, generate was preparing a DynamicCache for recurrent_gemma. This cache was never used in forward, but it was inducing the correct behavior in prepare_inputs_for_generation (cache_position-based input slicing)

With the new logic, use_cache=True implies cache_position-based input slicing, even if the model is not using a standard cache.

gante · 2025-10-11T10:12:19Z

            past_key_values = (
                EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache(config=self.config))
-                if encoder_hidden_states is not None
+                if encoder_hidden_states is not None or self.config.is_encoder_decoder


On encoder-decoder models that we may want to use as decoder-only, we want EncoderDecoderCache in two possible situations:

[missing] the model is encoder-decoder

the model is not encoder-decoder, but encoder_hidden_states passed (which means we will compute the cross-attention, and thus we should cache it)

gante · 2025-10-11T10:14:51Z

        choice_labels,
    ):
+        config = copy.deepcopy(config)
+        config.is_decoder = True


RoFormerForCausalLM won't use the cache if config.is_decoder!=True, and this test tests cache usage 🙃

(A warning was being thrown)

gante · 2025-10-11T10:33:27Z

+        # build `cache_position` on the fly
+        seq_length = inputs["input_ids"].shape[1]
+        inputs = self.model._get_initial_cache_position(seq_length, self.model.device, inputs)
+        # prepare other inputs


whisper has custom generation structure that doesn't follow our code patterns -> one of the issues is that it has stateful LogitsProcessor -> this state-related function uses prepare_inputs_for_generation out of the usual order -> changes related to this PR exposed that it was missing cache_position as an input here

(if we have bandwidth, we should revisit whisper generate to streamline its code)

(if we have bandwidth, we should revisit whisper generate to streamline its code)

💯, revisiting audio modality generation will be super helpful

Cyrilvallez

LGTM, thanks!!

Cyrilvallez · 2025-10-13T08:22:42Z

-        if model_kwargs.get("past_key_values") is not None:
+        if model_kwargs.get("past_key_values", None) is not None:


Technically not needed, get has None as a default (and I thought ruff was now enforcing this one but apparently not 🤔 too many changes to the ruff rules recently)

Cyrilvallez · 2025-10-13T08:25:47Z

-        # initialize `past_key_values`
-        if use_cache and past_key_values is None:
-            past_key_values = EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache(config=self.config))
-


EncoderDecoderCache initialized in a Decoder only module... 🥲🥲🥲 good catch!

zucchini-nlp

Thanks, left one question that seems to be a typo or just my bad understading

zucchini-nlp · 2025-10-13T08:54:52Z

+        # build `cache_position` on the fly
+        seq_length = inputs["input_ids"].shape[1]
+        inputs = self.model._get_initial_cache_position(seq_length, self.model.device, inputs)
+        # prepare other inputs


(if we have bandwidth, we should revisit whisper generate to streamline its code)

💯, revisiting audio modality generation will be super helpful

zucchini-nlp · 2025-10-13T08:59:14Z

        model_inputs["cache_position"] = cache_position

        # 2. Generic cache-dependent input preparation
+        use_cache = kwargs.get("use_cache", False) or getattr(self.config, "use_cache", False)


this will result in True even when the user-kwargs set caching as False. IIUC we just want to give priority to user-defined cache and not assume that caching is used whenever it is set to True in any of the places

@zucchini-nlp that's a good catch, user-defined generate kwargs >> config values!

Will update accordingly

zucchini-nlp · 2025-10-13T09:01:51Z

+        use_cache = kwargs.get("use_cache", False) or getattr(self.config, "use_cache", False)
        if past_key_values is not None:
            model_inputs["past_key_values"] = past_key_values
+        if past_key_values is None or use_cache:


maybe i am missing smth, do we apply cache slicing when the past_key_values is None? Looks not intuitive from first sight, so let's add a comment explaining why

stateful models like recurrent_gemma assume that slicing happens, but don't have a Cache cache -- will add a comment :)

so they don't have a Cache object and also if use_cache does not catch those cases?

github-actions · 2025-10-13T11:54:30Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: bart, bert, bert_generation, bigbird_pegasus, blenderbot, blenderbot_small, camembert, data2vec, electra, ernie, fsmt, kosmos2, marian, mbart, mvp, pegasus

…huggingface#41505)

delegate cache init

ba8d852

gante changed the title ~~[generate] delegate default cache initialization to the model~~ 🚨 [v5] generate delegates default cache initialization to the model Oct 10, 2025

gante mentioned this pull request Oct 10, 2025

Welcome v5 #40822

Closed

gante added 3 commits October 10, 2025 09:25

path for decoder-only cache in encoder-decoder models

2d65e9a

fsmt initializes the cache in the wrong place

e5455cf

fix recurrent gemma

737207d

gante commented Oct 10, 2025

View reviewed changes

defaults

c6b40a0

gante commented Oct 10, 2025

View reviewed changes

gante and others added 4 commits October 10, 2025 13:09

cache init

6c09d0a

cache init condition

465240f

whisper :'(

30ad0a9

Merge branch 'main' into delegate_cache_init

77a84a6

gante commented Oct 11, 2025

View reviewed changes

gante requested review from Cyrilvallez and zucchini-nlp October 11, 2025 10:30

gante commented Oct 11, 2025

View reviewed changes

Cyrilvallez approved these changes Oct 13, 2025

View reviewed changes

zucchini-nlp reviewed Oct 13, 2025

View reviewed changes

PR suggestions by Raushan

93486fc

gante merged commit d621be8 into huggingface:main Oct 13, 2025
25 checks passed

gante deleted the delegate_cache_init branch October 13, 2025 12:20

This was referenced Oct 14, 2025

CI fails with dev dependencies: TypeError: 'NoneType' object is not subscriptable huggingface/trl#4272

Closed

CI fails with dev dependencies: torch.AcceleratorError: CUDA error: device-side assert triggered huggingface/trl#4281

Closed

qgallouedec mentioned this pull request Oct 17, 2025

🧺 [4/N] Refactor _generate in GRPO/RLOO: Move forward_kwargs outside generation method huggingface/trl#4154

Merged

This was referenced Oct 21, 2025

Fix CUDA index out of bounds for q_idx in VLM token type masking for Gemma3, PaliGemma, and example modular #41757

Merged

Fix logic error in prepare_inputs_for_generation cache slicing condition #41764

Merged

ngazagna-qc pushed a commit to ngazagna-qc/transformers that referenced this pull request Oct 23, 2025

🚨 [v5] generate delegates default cache initialization to the model (…

80a39d4

…huggingface#41505)

Cyrilvallez mentioned this pull request Dec 3, 2025

Fix some models cache initialization #42586

Merged

SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026

🚨 [v5] generate delegates default cache initialization to the model (…

300d63e

…huggingface#41505)

BBC-Esq mentioned this pull request Mar 30, 2026

Upgrade to HF transformers >= v5 docling-project/docling#3090

Closed

SeaL773 mentioned this pull request May 5, 2026

Fix UnboundLocalError for is_updated in encoder-decoder cross-attention #45773

Open

		@@ -546,8 +546,10 @@ def prepare_inputs_for_generation(
		model_inputs["cache_position"] = cache_position

		# 2. Generic cache-dependent input preparation

		if model_kwargs.get("past_key_values") is not None:
		if model_kwargs.get("past_key_values", None) is not None:

Conversation

gante commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gante commented Oct 10, 2025 •

edited

Loading

gante Oct 11, 2025 •

edited

Loading