[peft] feat: validate PEFT target modules with deferred init#3587
Merged
Conversation
Re-introduces the target_modules validation from #1747 while fixing the post-init bug that caused finetune-tests breakage on recipes that mutate ``target_modules`` after construction (e.g. llama3 405B/70B perf, qwen3 moe). Alias bookkeeping now lives behind ``_init_target_match_state`` and is rebuilt at ``PEFT.__call__`` time so it reflects the current value of ``self.target_modules`` rather than a snapshot of the defaults captured at dataclass construction. CanonicalLoRA still builds ``canonical_mapping`` eagerly in __post_init__ (now via ``_init_target_match_state``) so the linear_qkv/linear_fc1 asserts and any external reads of ``canonical_mapping`` keep working, and the same method re-runs at apply time so post-construction mutation of ``target_modules`` is also respected. Adds regression tests for both the typo case and the mutate-after-construction case for plain LoRA and CanonicalLoRA. Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu>
Contributor
Author
|
/ok to test a0f112a |
yaoyu-33
reviewed
Apr 30, 2026
Downgrade the missed-target check from ValueError to a logger.warning so recipes whose ``target_modules`` defaults are wider than the model exposes (e.g. CanonicalLoRA defaults vs. a model without fused linear_qkv) keep working without per-recipe overrides while still surfacing typos in the logs. Update the LoRA / CanonicalLoRA unit tests to assert on the warning record via caplog instead of pytest.raises. Signed-off-by: Chen Cui <chcui@nvidia.com>
Contributor
Author
|
/ok to test fddb1af |
yaoyu-33
approved these changes
May 3, 2026
vasunvidia
pushed a commit
to vasunvidia/Megatron-Bridge
that referenced
this pull request
Jun 10, 2026
…NeMo#3587) Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-introduces the
target_modulestypo-detection from #1747 (which was reverted in #3583) while fixing the post-init bug that broke finetune-perf tests on recipes that mutatetarget_modulesafter constructing the PEFT object.Why #1747 broke
ModuleMatcher.__post_init__snapshotted the alias bookkeeping at dataclass construction time:But several recipes construct the PEFT with the defaults and then narrow the targets. For example,
src/megatron/bridge/recipes/llama/llama3.py:After this mutation,
_alias_matchesstill contained the four default keys (linear_qkv,linear_proj,linear_fc1,linear_fc2), but the validation walk only iterates the currenttarget_modules(["linear_qkv"]), so three keys were forever empty and validation raisedValueError: No modules matched the requested target_modules entries: linear_fc1, linear_fc2, linear_projon every PEFT recipe that narrowed defaults.Fix
Defer all alias bookkeeping to
_init_target_match_state, which is called fromPEFT.__call__immediately before the validation walk. This always reflects the current value ofself.target_modules, regardless of post-construction mutation.CanonicalLoRAkeeps an eager build in__post_init__(now a thin wrapper that calls_init_target_match_state) so the earlylinear_qkv/linear_fc1asserts and any caller that inspectscanonical_mappingbefore applying the PEFT still work. The same method re-runs at apply time, so canonical_mapping is also rebuilt from the currenttarget_modules.Files
src/megatron/bridge/peft/module_matcher.py— move alias init out of__post_init__into_init_target_match_state; addregister_target_alias,_record_match,_reset_target_match_state,_validate_target_matches.src/megatron/bridge/peft/base.py—PEFT.__call__invokes_init_target_match_statethen runs the validation walk before transform.src/megatron/bridge/peft/canonical_lora.py— replace__post_init__body with deferred build; keep eager call from__post_init__for backward compatibility.tests/unit_tests/peft/test_lora.py,tests/unit_tests/peft/test_canonical_lora.py— regression tests for both the typo case (matches feat: Validate PEFT target modules #1747's intent) and the mutate-after-construction case (the actual bug fixed).Test plan
pytest tests/unit_tests/peft/test_lora.py tests/unit_tests/peft/test_canonical_lora.py -v).llama3_70b_8gpu_h100_bf16_50steps_perf_finetune_lora,llama3_405b_*_lora,qwen3_moe_*_lora) — they all narrowtarget_modulesafter construction and should now pass instead of raisingValueError: No modules matched the requested target_modules entries.test_finetune_lora,test_finetune_dora) continue to pass.Follow-ups not in this PR
all_reduce(so wildcards like*.layers.0.*.linear_qkvand Mamba+Transformertarget_moduleslists also work cleanly).PEFT.params_to_save: set[str]from the YAML config dump (rename to_params_to_savesoConfigContainer.to_dict's underscore filter skips it; this fixes the latentObject of type set is not JSON serializablefailure in CI's after-script).cc @yaoyu-33