Fix OOM regression for FSDP2 + cpu_ram_efficient_loading on large models by AmineDiro · Pull Request #45649 · huggingface/transformers

AmineDiro · 2026-04-25T21:12:04Z

What does this PR do?

PR #45050 replaces torch.empty_like with torch.zeros_like in _move_missing_keys_from_meta_to_device. While this fixes a real issue (NaN garbage in uninitialized memory), it forces a physical-memory commit of the entire model on every non-rank-0 FSDP rank.
With 8 ranks per node loading a 30B model, peak cpu mem jumps from ~60 GB to ~480 GB :/

The regression was identified by bisecting transformers commits between 2026-04-10 (working) and 2026-04-22 (failing) using a 2-node FSDP2 control config:

Commit	Date	Result	MFU
`c43f15c`	2026-04-10	PASS	21.46%
`a001f34439` (pre-#45050)	2026-04-13	PASS	21.13%
`ff49f7c4cb` (PR #45050)	2026-04-13	FAIL	OOM
`e40b0c0`	2026-04-13 (post-#45050)	FAIL	OOM
`8426e7e`	2026-04-15	FAIL	OOM
`7a0d582`	2026-04-20	FAIL	OOM
`9dff7ca`	2026-04-21	FAIL	OOM
`cbe7a02`	2026-04-22	FAIL	OOM

Test config: Qwen/Qwen3-30B-A3B, FSDP2, 2 nodes × 8 H100, DP=16, sdpa, max_steps=5, fsdp_cpu_ram_efficient_loading=true.

The placeholder values on non-rank-0 ranks for state-dict params are immediately overwritte by fsdp2_load_full_state_dict during accelerate's FSDP2 prepare. accelerate moves the entire model to meta device before sharding in accelerate.utils.fsdp_utils.fsdp2_prepare_model
So allocating CPU placeholders for parameters on non-rank-0 ranks is unnecessary work. The parameters can stay on meta. Btw, from what I can understand buffers (RoPE caches, attention masks, etc.) are per-rank and not part of the broadcast, so they still need real allocations.

Fixes # (issue)

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

Did you read the [contributor guideline]

Who can review?

@albertvillanova @ArthurZucker

HuggingFaceDocBuilderDev · 2026-04-25T21:22:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2026-04-27T11:29:07Z

cc @Cyrilvallez I think

albertvillanova

Thanks a lot for the clear diagnosis and the fix: Skip CPU param materialization on non-rank-0 FSDP ranks to avoid OOM

The OOM regression in #45050 is real: zeros_like forces an immediate physical-memory commit (page fault on every zero write), whereas empty_like relies on overcommit/lazy allocation. Note this was already commented by @ArthurZucker: https://github.com/huggingface/transformers/pull/45050/changes#r3029107360

the reason I don't want this is because its costly!

Let me trace through the full flow after the change to confirm:

On non-rank-0 FSDP ranks:
- Parameters stay on meta device: zero physical memory committed
- Buffers (both persistent and non-persistent) get real CPU zeros_like placeholders
Then _initialize_missing_keys (PR #44473) marks state-dict parameters (now meta tensors) as _is_hf_initialized = True. initialize_weights() then runs: for RotaryEmbedding, inv_freq and original_inv_freq are non-persistent buffers, so they are not in state_dict(), not marked, and _init_weights correctly computes and copies their values into the real CPU zero tensors
Accelerate's fsdp2_prepare_model then:
- Saves non-persistent buffers (now correctly initialized by _init_weights) from each rank
- Moves the model to meta; parameters that were already on meta: no-op
- Applies fully_shard
- fsdp2_load_full_state_dict broadcasts from rank-0 into all ranks: parameters receive correct values
- Restores non-persistent buffers from each rank's saved copy

The original NaN bug is still fixed: parameters that _init_weights skips (marked as initialized) are subsequently overwritten by the broadcast with rank-0's values. The difference from #45050 is that we never pay the cost of materializing them on non-rank-0 in the first place.

The fix is correct, targeted, and eliminates the OOM without reintroducing the NaN regression (I have confirmed this). 🤗

Cyrilvallez · 2026-05-11T05:13:14Z

@albertvillanova @AmineDiro I just pushed what is to me the correct fix - basically only move non-persistent buffers. Could you double-check? I'm not 100% familiar of how fsdp2 works

albertvillanova · 2026-05-18T08:17:45Z

Thanks for addressing this as well, @Cyrilvallez.

May I recommend to run one test with FSDP2 + cpu_ram_efficient_loading on a model that has at least one persistent buffer? Maybe something like DeepSeek V3 or creating a toy model with self.register_buffer("bias", torch.zeros(n)) (no persistent=False).

If the forward pass succeeds and the persistent buffer has the correct values from rank-0, the fix is confirmed. If it crashes with AttributeError or produces wrong values, the named_non_persistent_buffers() line should be changed back to named_buffers(). Note that the memory cost of persistent buffers is negligible, and the OOM was caused entirely by parameters, which are no longer materialized.

AmineDiro · 2026-05-19T17:54:05Z

@Cyrilvallez I think 5623608 should also work , this was a regression so I just fixed it back to the previous non zeroed memory. But it should work fine 👍🏼

Cyrilvallez · 2026-05-25T07:52:06Z

@albertvillanova Did you have one test in mind by any chance? I don't have an env setup with fsdp rn, so would appreciate it if you can quickly try it out by any chance 🙏🤗

Skip CPU param materialization on non-rank-0 FSDP ranks to avoid OOM

74480d4

albertvillanova approved these changes Apr 28, 2026

View reviewed changes

Comment thread src/transformers/modeling_utils.py Outdated

This was referenced May 6, 2026

🛣️ Path to 30B MoE long-context SFT training huggingface/trl#5712

Closed

🛣️ Path to 30B MoE long-context SFT training huggingface/trl#5713

Open

fix it

5623608

Merge branch 'main' into fix-fsdp2-cpu-ram-zeros-like

50f2f88

Merge branch 'main' into fix-fsdp2-cpu-ram-zeros-like

323dec3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM regression for FSDP2 + cpu_ram_efficient_loading on large models#45649

Fix OOM regression for FSDP2 + cpu_ram_efficient_loading on large models#45649
AmineDiro wants to merge 4 commits into
huggingface:mainfrom
AmineDiro:fix-fsdp2-cpu-ram-zeros-like

AmineDiro commented Apr 25, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 25, 2026

Uh oh!

Rocketknight1 commented Apr 27, 2026

Uh oh!

albertvillanova left a comment

Uh oh!

Uh oh!

Cyrilvallez commented May 11, 2026

Uh oh!

albertvillanova commented May 18, 2026

Uh oh!

AmineDiro commented May 19, 2026

Uh oh!

Cyrilvallez commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

AmineDiro commented Apr 25, 2026

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 25, 2026

Uh oh!

Rocketknight1 commented Apr 27, 2026

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Cyrilvallez commented May 11, 2026

Uh oh!

albertvillanova commented May 18, 2026

Uh oh!

AmineDiro commented May 19, 2026

Uh oh!

Cyrilvallez commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants