Skip to content

[peft] feat: validate PEFT target modules with deferred init#3587

Merged
cuichenx merged 4 commits into
mainfrom
chcui/peft-validate-target-modules-deferred-init
May 3, 2026
Merged

[peft] feat: validate PEFT target modules with deferred init#3587
cuichenx merged 4 commits into
mainfrom
chcui/peft-validate-target-modules-deferred-init

Conversation

@cuichenx

Copy link
Copy Markdown
Contributor

Summary

Re-introduces the target_modules typo-detection from #1747 (which was reverted in #3583) while fixing the post-init bug that broke finetune-perf tests on recipes that mutate target_modules after constructing the PEFT object.

Why #1747 broke

ModuleMatcher.__post_init__ snapshotted the alias bookkeeping at dataclass construction time:

def __post_init__(self) -> None:
    for target in self.target_modules or []:
        self.register_target_alias(target, target)

But several recipes construct the PEFT with the defaults and then narrow the targets. For example, src/megatron/bridge/recipes/llama/llama3.py:

peft_cfg.target_modules = ["linear_qkv"]   # 405B/70B perf recipes

After this mutation, _alias_matches still contained the four default keys (linear_qkv, linear_proj, linear_fc1, linear_fc2), but the validation walk only iterates the current target_modules (["linear_qkv"]), so three keys were forever empty and validation raised ValueError: No modules matched the requested target_modules entries: linear_fc1, linear_fc2, linear_proj on every PEFT recipe that narrowed defaults.

Fix

Defer all alias bookkeeping to _init_target_match_state, which is called from PEFT.__call__ immediately before the validation walk. This always reflects the current value of self.target_modules, regardless of post-construction mutation.

CanonicalLoRA keeps an eager build in __post_init__ (now a thin wrapper that calls _init_target_match_state) so the early linear_qkv / linear_fc1 asserts and any caller that inspects canonical_mapping before applying the PEFT still work. The same method re-runs at apply time, so canonical_mapping is also rebuilt from the current target_modules.

Files

  • src/megatron/bridge/peft/module_matcher.py — move alias init out of __post_init__ into _init_target_match_state; add register_target_alias, _record_match, _reset_target_match_state, _validate_target_matches.
  • src/megatron/bridge/peft/base.pyPEFT.__call__ invokes _init_target_match_state then runs the validation walk before transform.
  • src/megatron/bridge/peft/canonical_lora.py — replace __post_init__ body with deferred build; keep eager call from __post_init__ for backward compatibility.
  • tests/unit_tests/peft/test_lora.py, tests/unit_tests/peft/test_canonical_lora.py — regression tests for both the typo case (matches feat: Validate PEFT target modules #1747's intent) and the mutate-after-construction case (the actual bug fixed).

Test plan

  • New unit tests cover both typo detection and post-construction mutation (pytest tests/unit_tests/peft/test_lora.py tests/unit_tests/peft/test_canonical_lora.py -v).
  • Re-run the previously failing perf tests (llama3_70b_8gpu_h100_bf16_50steps_perf_finetune_lora, llama3_405b_*_lora, qwen3_moe_*_lora) — they all narrow target_modules after construction and should now pass instead of raising ValueError: No modules matched the requested target_modules entries.
  • Existing PEFT functional tests (test_finetune_lora, test_finetune_dora) continue to pass.

Follow-ups not in this PR

  • A separate PR to make the validation tolerant of pipeline parallelism / hybrid models by aggregating per-alias match counts across the PP group with a tiny all_reduce (so wildcards like *.layers.0.*.linear_qkv and Mamba+Transformer target_modules lists also work cleanly).
  • A separate PR to drop PEFT.params_to_save: set[str] from the YAML config dump (rename to _params_to_save so ConfigContainer.to_dict's underscore filter skips it; this fixes the latent Object of type set is not JSON serializable failure in CI's after-script).

cc @yaoyu-33

Re-introduces the target_modules validation from #1747 while fixing the
post-init bug that caused finetune-tests breakage on recipes that mutate
``target_modules`` after construction (e.g. llama3 405B/70B perf, qwen3
moe). Alias bookkeeping now lives behind ``_init_target_match_state`` and
is rebuilt at ``PEFT.__call__`` time so it reflects the current value of
``self.target_modules`` rather than a snapshot of the defaults captured
at dataclass construction.

CanonicalLoRA still builds ``canonical_mapping`` eagerly in __post_init__
(now via ``_init_target_match_state``) so the linear_qkv/linear_fc1
asserts and any external reads of ``canonical_mapping`` keep working,
and the same method re-runs at apply time so post-construction mutation
of ``target_modules`` is also respected.

Adds regression tests for both the typo case and the
mutate-after-construction case for plain LoRA and CanonicalLoRA.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread src/megatron/bridge/peft/canonical_lora.py Outdated
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu>
@cuichenx

Copy link
Copy Markdown
Contributor Author

/ok to test a0f112a

Comment thread src/megatron/bridge/peft/module_matcher.py Outdated
@yaoyu-33 yaoyu-33 added the waiting-on-customer Waiting on the original author to respond label Apr 30, 2026
Downgrade the missed-target check from ValueError to a logger.warning so
recipes whose ``target_modules`` defaults are wider than the model
exposes (e.g. CanonicalLoRA defaults vs. a model without fused
linear_qkv) keep working without per-recipe overrides while still
surfacing typos in the logs.

Update the LoRA / CanonicalLoRA unit tests to assert on the warning
record via caplog instead of pytest.raises.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx

Copy link
Copy Markdown
Contributor Author

/ok to test fddb1af

@cuichenx cuichenx requested a review from yaoyu-33 April 30, 2026 23:12
@cuichenx cuichenx enabled auto-merge (squash) May 1, 2026 23:54
@cuichenx cuichenx merged commit df232f8 into main May 3, 2026
149 of 151 checks passed
@cuichenx cuichenx deleted the chcui/peft-validate-target-modules-deferred-init branch May 3, 2026 04:00
vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026
…NeMo#3587)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants