[peft] feat: validate PEFT target modules with deferred init by cuichenx · Pull Request #3587 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-04-29T23:56:51Z

Summary

Re-introduces the target_modules typo-detection from #1747 (which was reverted in #3583) while fixing the post-init bug that broke finetune-perf tests on recipes that mutate target_modules after constructing the PEFT object.

Why #1747 broke

ModuleMatcher.__post_init__ snapshotted the alias bookkeeping at dataclass construction time:

def __post_init__(self) -> None:
    for target in self.target_modules or []:
        self.register_target_alias(target, target)

But several recipes construct the PEFT with the defaults and then narrow the targets. For example, src/megatron/bridge/recipes/llama/llama3.py:

peft_cfg.target_modules = ["linear_qkv"]   # 405B/70B perf recipes

After this mutation, _alias_matches still contained the four default keys (linear_qkv, linear_proj, linear_fc1, linear_fc2), but the validation walk only iterates the current target_modules (["linear_qkv"]), so three keys were forever empty and validation raised ValueError: No modules matched the requested target_modules entries: linear_fc1, linear_fc2, linear_proj on every PEFT recipe that narrowed defaults.

Fix

Defer all alias bookkeeping to _init_target_match_state, which is called from PEFT.__call__ immediately before the validation walk. This always reflects the current value of self.target_modules, regardless of post-construction mutation.

CanonicalLoRA keeps an eager build in __post_init__ (now a thin wrapper that calls _init_target_match_state) so the early linear_qkv / linear_fc1 asserts and any caller that inspects canonical_mapping before applying the PEFT still work. The same method re-runs at apply time, so canonical_mapping is also rebuilt from the current target_modules.

Files

src/megatron/bridge/peft/module_matcher.py — move alias init out of __post_init__ into _init_target_match_state; add register_target_alias, _record_match, _reset_target_match_state, _validate_target_matches.
src/megatron/bridge/peft/base.py — PEFT.__call__ invokes _init_target_match_state then runs the validation walk before transform.
src/megatron/bridge/peft/canonical_lora.py — replace __post_init__ body with deferred build; keep eager call from __post_init__ for backward compatibility.
tests/unit_tests/peft/test_lora.py, tests/unit_tests/peft/test_canonical_lora.py — regression tests for both the typo case (matches feat: Validate PEFT target modules #1747's intent) and the mutate-after-construction case (the actual bug fixed).

Test plan

New unit tests cover both typo detection and post-construction mutation (pytest tests/unit_tests/peft/test_lora.py tests/unit_tests/peft/test_canonical_lora.py -v).
Re-run the previously failing perf tests (llama3_70b_8gpu_h100_bf16_50steps_perf_finetune_lora, llama3_405b_*_lora, qwen3_moe_*_lora) — they all narrow target_modules after construction and should now pass instead of raising ValueError: No modules matched the requested target_modules entries.
Existing PEFT functional tests (test_finetune_lora, test_finetune_dora) continue to pass.

Follow-ups not in this PR

A separate PR to make the validation tolerant of pipeline parallelism / hybrid models by aggregating per-alias match counts across the PP group with a tiny all_reduce (so wildcards like *.layers.0.*.linear_qkv and Mamba+Transformer target_modules lists also work cleanly).
A separate PR to drop PEFT.params_to_save: set[str] from the YAML config dump (rename to _params_to_save so ConfigContainer.to_dict's underscore filter skips it; this fixes the latent Object of type set is not JSON serializable failure in CI's after-script).

cc @yaoyu-33

Re-introduces the target_modules validation from #1747 while fixing the post-init bug that caused finetune-tests breakage on recipes that mutate ``target_modules`` after construction (e.g. llama3 405B/70B perf, qwen3 moe). Alias bookkeeping now lives behind ``_init_target_match_state`` and is rebuilt at ``PEFT.__call__`` time so it reflects the current value of ``self.target_modules`` rather than a snapshot of the defaults captured at dataclass construction. CanonicalLoRA still builds ``canonical_mapping`` eagerly in __post_init__ (now via ``_init_target_match_state``) so the linear_qkv/linear_fc1 asserts and any external reads of ``canonical_mapping`` keep working, and the same method re-runs at apply time so post-construction mutation of ``target_modules`` is also respected. Adds regression tests for both the typo case and the mutate-after-construction case for plain LoRA and CanonicalLoRA. Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-04-29T23:56:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu>

cuichenx · 2026-04-30T00:13:11Z

/ok to test a0f112a

…init

Downgrade the missed-target check from ValueError to a logger.warning so recipes whose ``target_modules`` defaults are wider than the model exposes (e.g. CanonicalLoRA defaults vs. a model without fused linear_qkv) keep working without per-recipe overrides while still surfacing typos in the logs. Update the LoRA / CanonicalLoRA unit tests to assert on the warning record via caplog instead of pytest.raises. Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-04-30T20:13:25Z

/ok to test fddb1af

…NeMo#3587) Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread src/megatron/bridge/peft/canonical_lora.py Outdated

Update src/megatron/bridge/peft/canonical_lora.py

a0f112a

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu>

copy-pr-bot Bot temporarily deployed to test April 30, 2026 00:14 Inactive

yaoyu-33 reviewed Apr 30, 2026

View reviewed changes

Comment thread src/megatron/bridge/peft/module_matcher.py Outdated

yaoyu-33 added the waiting-on-customer Waiting on the original author to respond label Apr 30, 2026

cuichenx added 2 commits April 30, 2026 10:53

Merge branch 'main' into chcui/peft-validate-target-modules-deferred-…

97b078a

…init

copy-pr-bot Bot temporarily deployed to test April 30, 2026 20:14 Inactive

cuichenx requested a review from yaoyu-33 April 30, 2026 23:12

cuichenx enabled auto-merge (squash) May 1, 2026 23:54

yaoyu-33 approved these changes May 3, 2026

View reviewed changes

cuichenx merged commit df232f8 into main May 3, 2026
149 of 151 checks passed

cuichenx deleted the chcui/peft-validate-target-modules-deferred-init branch May 3, 2026 04:00

cuichenx mentioned this pull request May 8, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[peft] feat: validate PEFT target modules with deferred init#3587

[peft] feat: validate PEFT target modules with deferred init#3587
cuichenx merged 4 commits into
mainfrom
chcui/peft-validate-target-modules-deferred-init

cuichenx commented Apr 29, 2026

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

Uh oh!

cuichenx commented Apr 30, 2026

Uh oh!

Uh oh!

cuichenx commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cuichenx commented Apr 29, 2026

Summary

Why #1747 broke

Fix

Files

Test plan

Follow-ups not in this PR

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

Uh oh!

cuichenx commented Apr 30, 2026

Uh oh!

Uh oh!

cuichenx commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants