Skip to content

Bridging perf from NeMo2 to Mbridge for certain configs#2199

Merged
gautham-kollu merged 12 commits into
mainfrom
gk/bridge_perfs
Feb 11, 2026
Merged

Bridging perf from NeMo2 to Mbridge for certain configs#2199
gautham-kollu merged 12 commits into
mainfrom
gk/bridge_perfs

Conversation

@gautham-kollu

@gautham-kollu gautham-kollu commented Feb 3, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Code changes

  1. Instead of creating one dataset iterator per model_chunk when VP is enabled, creates one and making the other things shallow point the first one. Previously this was already implemented for finetuning, extending this to pretraining
  2. Adding a new config to not do numeric_checks post_training. (This is only for advanced users)
  3. Skip computing log metrics if logging is not enabled.
  4. Precompute a few things outside of the training loop to avoid redundant compute

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added numeric_checks configuration option for training validation.
  • Improvements

    • Enhanced multi-model training support with improved iterator handling.
    • Optimized training loop logging behavior.
    • Refined fault tolerance and profiling operation execution.
    • Improved checkpoint decision logic with early-exit handling.

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Feb 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>
@coderabbitai

coderabbitai Bot commented Feb 3, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

The changes introduce P2P communication support for training, add multi-model chunk data iterator handling, refactor logging and loss scaling logic, simplify data loader initialization, and add early-exit guards to fault-tolerance and profiling modules. A new numeric checks configuration option is also added.

Changes

Cohort / File(s) Summary
Data Loading & Iteration
src/megatron/bridge/data/loaders.py, src/megatron/bridge/training/eval.py
Simplified data iterator initialization by replacing conditional multi-iterator setup with direct calls. Multi-chunk model support added for evaluation path via make_data_iterator_list.
Configuration
src/megatron/bridge/training/config.py
Added new numeric_checks: bool = True field to TrainingConfig class for enabling numeric validation during training.
Training & P2P Communication
src/megatron/bridge/training/train.py
Introduces P2PCommunicator instance creation and propagation through training steps. Adds multi-model data iterator conversion, refactors logging/loss-scaling logic with tensorboard condition gating, and adds checkpoint-saving early-exit guard.
Safety Guards
src/megatron/bridge/training/fault_tolerance.py, src/megatron/bridge/training/profiling.py
Added early-return guards in fault-tolerance callbacks when FT config is absent and in profiling handlers when config is None, making FT and profiling operations no-ops in unconfigured states.

Sequence Diagram(s)

sequenceDiagram
    participant Train as Training Loop
    participant P2P as P2PCommunicator
    participant FB as forward_backward_func
    participant MI as Model Iterators
    participant Logger as Logger

    Train->>P2P: Create P2PCommunicator(pp_group, config)
    Train->>Train: Prepare forward_backward_data_iterator
    alt Multi-model setup
        Train->>MI: make_data_iterator_list(iterator)
        MI->>Train: List of iterators (one per chunk)
    end
    Train->>FB: Call with p2p_communicator & data iterators
    FB->>P2P: Use P2P for peer-to-peer communication
    FB->>Train: Return loss & metrics
    alt Logging enabled
        Train->>Logger: Extract learning_rate from optimizer
        Train->>Logger: Log loss_scale, learning_rate, params_norm
        Logger->>Train: Logging complete
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR description lacks required documentation: no concrete overview, no actual changelog entries, no test results, no performance benchmarks, and multiple unresolved review comments. Complete PR description with concrete overview, actual changelog entries, test results, before-and-after performance benchmarks, configuration context, and address all open review comments.
Title check ❓ Inconclusive The title is vague and generic, using broad terms like 'perf' and 'certain configs' without clearly identifying the main technical change or which specific components are being modified. Revise the title to be more specific and descriptive of the main technical change, such as identifying the key modifications (e.g., 'Simplify data iterator construction for virtual pipeline parallelism') or which specific performance optimization is being addressed.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch gk/bridge_perfs

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gautham-kollu

Copy link
Copy Markdown
Contributor Author

/ok to test d694715

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/megatron/bridge/data/loaders.py (1)

360-382: ⚠️ Potential issue | 🟡 Minor

Update setup_data_iterators docstring to match single-pass behavior.

The function no longer builds per-virtual-stage iterators, but the docstring still says it does. Please align the docstring (and return description) with the new behavior to avoid confusion.

📝 Suggested docstring update
-    Calls `build_train_valid_test_data_iterators` potentially multiple times
-    if virtual pipeline parallelism is used, creating separate iterators for each
-    virtual stage.
+    Builds train/valid/test iterators in a single call and returns them directly.
+    Virtual pipeline parallelism is handled downstream by iterator wrapping.
@@
-        Each element can be a single iterator or a list of iterators if virtual
-        pipeline parallelism is enabled.
+        Each element is a single iterator; downstream code may wrap it for virtual
+        pipeline parallelism when needed.
src/megatron/bridge/training/train.py (1)

713-721: ⚠️ Potential issue | 🟠 Major

Skip make_data_iterator_list when the iterator is already a list.

External dataloaders may already provide a list of iterators for VPP. Wrapping a list will cause make_data_iterator_list to call next() on a list and fail.

🐛 Suggested fix
-        if len(model) > 1:
+        if len(model) > 1 and not isinstance(forward_backward_data_iterator, list):
             # As MLM, expects a list of iterators for virtual pipeline parallelism. One iterator per model chunk.
             forward_backward_data_iterator = make_data_iterator_list(
                 model=model,
                 data_iterator=forward_backward_data_iterator,
             )
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/eval.py`:
- Around line 137-150: The guard around make_data_iterator_list should avoid
re-wrapping pre-built lists of iterators; change the branch that currently does
make_data_iterator_list(model=model, data_iterator=eval_data_iterator) to only
call make_data_iterator_list when eval_data_iterator is not already a sequence
of iterators (e.g., check isinstance(eval_data_iterator, (list, tuple)) or
similar) so that prepare_finetuning_batch’s returned eval_data_iterator is left
untouched when it is a list/tuple; reference prepare_finetuning_batch,
eval_data_iterator, make_data_iterator_list, and model when applying this
conditional check.

In `@src/megatron/bridge/training/train.py`:
- Line 221: Remove the commented-out debug toggles to eliminate dead code:
delete the lines containing the commented variable name
should_toggle_forward_pre_hook (and any duplicate/commented occurrences around
the other reported location) unless you intentionally want to keep them; if so,
replace the bare commented lines with a short rationale comment explaining
when/how to enable this debug toggle and why it is preserved (e.g., "kept for
temporary debugging of forward pre-hook behavior; remove before merge"). Ensure
you update any nearby comments in the same function or module (train.py) to
reflect the change.
- Around line 1166-1167: The early return guarded by "if not
state.cfg.checkpoint.save: return False" prevents the exit-signal, duration,
interval, and straggler checks from running; instead of returning immediately
when state.cfg.checkpoint.save is false, skip only the checkpoint save action
and still run the existing exit-condition checks (exit-signal, duration,
interval, straggler) before returning; locate the block that references
state.cfg.checkpoint.save in train.py and refactor so checkpoint saving is
conditional but the exit-checking logic (the functions/conditions that evaluate
exit signals, duration, interval and stragglers) always executes regardless of
checkpoint.save.
- Around line 770-778: The check for update_successful must always be
synchronized across model-parallel ranks: move the call to
logical_and_across_model_parallel_group(update_successful,
mp_group=pg_collection.mp) out of the train_config.numeric_checks conditional so
update_successful is reduced unconditionally; leave the grad_norm and
num_zeros_in_grad reductions (reduce_max_stat_across_model_parallel_group)
inside the if train_config.numeric_checks block so those stats are only gathered
when numeric checks are enabled.
- Around line 470-506: The current outer check uses
config.logger.tensorboard_dir to gate all logging, which unintentionally
disables WandB/MLflow/console logging; remove or loosen that gate so
training_log(...) is invoked regardless of tensorboard_dir (or replace the
condition with a check that any sink is enabled, e.g.,
config.logger.use_tensorboard or config.logger.use_wandb or
config.logger.use_mlflow), leaving the inner logic that computes loss_scale,
params_norm, learning_rate, etc., intact; keep the call to training_log(...) so
it can decide which sinks actually log.

Comment on lines +137 to 150
eval_data_iterator, seq_length = prepare_finetuning_batch(
data_iterator=data_iterator,
num_microbatches=eval_num_microbatches,
default_seq_length=state.cfg.model.seq_length,
seq_key="tokens",
)

if len(model) > 1:
# Convert to list of iterators for virtual pipeline parallelism
# With virtual PP, each model chunk needs independent access to the same microbatch
eval_data_iterator = make_data_iterator_list(
model=model,
data_iterator=eval_microbatch_iterator,
data_iterator=eval_data_iterator,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid wrapping pre-built iterator lists in make_data_iterator_list.

When data_iterator is already a list (external dataloaders or pre-wrapped VPP), make_data_iterator_list will treat the list as an iterator and next() will fail. Guard the wrapping to avoid a TypeError.

🐛 Suggested fix
-            if len(model) > 1:
+            if len(model) > 1 and not isinstance(eval_data_iterator, list):
                 # Convert to list of iterators for virtual pipeline parallelism
                 # With virtual PP, each model chunk needs independent access to the same microbatch
                 eval_data_iterator = make_data_iterator_list(
                     model=model,
                     data_iterator=eval_data_iterator,
                 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
eval_data_iterator, seq_length = prepare_finetuning_batch(
data_iterator=data_iterator,
num_microbatches=eval_num_microbatches,
default_seq_length=state.cfg.model.seq_length,
seq_key="tokens",
)
if len(model) > 1:
# Convert to list of iterators for virtual pipeline parallelism
# With virtual PP, each model chunk needs independent access to the same microbatch
eval_data_iterator = make_data_iterator_list(
model=model,
data_iterator=eval_microbatch_iterator,
data_iterator=eval_data_iterator,
)
eval_data_iterator, seq_length = prepare_finetuning_batch(
data_iterator=data_iterator,
num_microbatches=eval_num_microbatches,
default_seq_length=state.cfg.model.seq_length,
seq_key="tokens",
)
if len(model) > 1 and not isinstance(eval_data_iterator, list):
# Convert to list of iterators for virtual pipeline parallelism
# With virtual PP, each model chunk needs independent access to the same microbatch
eval_data_iterator = make_data_iterator_list(
model=model,
data_iterator=eval_data_iterator,
)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/eval.py` around lines 137 - 150, The guard
around make_data_iterator_list should avoid re-wrapping pre-built lists of
iterators; change the branch that currently does
make_data_iterator_list(model=model, data_iterator=eval_data_iterator) to only
call make_data_iterator_list when eval_data_iterator is not already a sequence
of iterators (e.g., check isinstance(eval_data_iterator, (list, tuple)) or
similar) so that prepare_finetuning_batch’s returned eval_data_iterator is left
untouched when it is a list/tuple; reference prepare_finetuning_batch,
eval_data_iterator, make_data_iterator_list, and model when applying this
conditional check.

Comment thread src/megatron/bridge/training/train.py Outdated
config.optimizer.use_distributed_optimizer,
config.ddp.overlap_param_gather,
)
#should_toggle_forward_pre_hook = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove commented-out debug toggles.

These commented lines add noise and violate the no-dead-code rule. Either remove them or add a rationale explaining why they must remain commented out.

🧹 Suggested cleanup
-    `#should_toggle_forward_pre_hook` = False
@@
-        `#if` config.train.rampup_batch_size is not None:

As per coding guidelines: If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging.

Also applies to: 309-309

🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` at line 221, Remove the commented-out
debug toggles to eliminate dead code: delete the lines containing the commented
variable name should_toggle_forward_pre_hook (and any duplicate/commented
occurrences around the other reported location) unless you intentionally want to
keep them; if so, replace the bare commented lines with a short rationale
comment explaining when/how to enable this debug toggle and why it is preserved
(e.g., "kept for temporary debugging of forward pre-hook behavior; remove before
merge"). Ensure you update any nearby comments in the same function or module
(train.py) to reflect the change.

Comment thread src/megatron/bridge/training/train.py Outdated
Comment on lines +470 to +506
if config.logger.tensorboard_dir is not None: # Skip logging as tensorboard logging is disabled.
if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer:
loss_scale = optimizer.get_loss_scale().item()
else:
learning_rate = param_group["lr"]
report_memory_flag = training_log(
loss_dict,
total_loss_dict,
learning_rate,
decoupled_learning_rate,
loss_scale,
report_memory_flag,
skipped_iter,
grad_norm,
params_norm,
num_zeros_in_grad,
config,
global_state,
history_wct,
model,
log_max_attention_logit,
)

loss_scale = 1.0
params_norm = None

if config.logger.log_params_norm:
params_norm = calc_params_l2_norm(model, model_config, use_megatron_fsdp=config.dist.use_megatron_fsdp)

learning_rate = None
decoupled_learning_rate = None
for param_group in optimizer.param_groups:
if len(param_group) == 0:
continue
if param_group["is_decoupled_lr"]:
decoupled_learning_rate = param_group["lr"]
else:
learning_rate = param_group["lr"]

report_memory_flag = training_log(
loss_dict,
total_loss_dict,
learning_rate,
decoupled_learning_rate,
loss_scale,
report_memory_flag,
skipped_iter,
grad_norm,
params_norm,
num_zeros_in_grad,
config,
global_state,
history_wct,
model,
log_max_attention_logit,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t gate all logging on tensorboard_dir.

This disables WandB/MLflow and console logging when tensorboard_dir is unset (default), which is a behavior regression. Gate on any enabled sink (or keep the call and let training_log decide).

🔧 Suggested fix
-        if config.logger.tensorboard_dir is not None: # Skip logging as tensorboard logging is disabled.
+        logging_enabled = (
+            config.logger.tensorboard_dir is not None
+            or global_state.wandb_logger is not None
+            or global_state.mlflow_logger is not None
+            or config.logger.log_progress
+        )
+        if logging_enabled:
             if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer:
                 loss_scale = optimizer.get_loss_scale().item()
             else:
                 loss_scale = 1.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if config.logger.tensorboard_dir is not None: # Skip logging as tensorboard logging is disabled.
if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer:
loss_scale = optimizer.get_loss_scale().item()
else:
learning_rate = param_group["lr"]
report_memory_flag = training_log(
loss_dict,
total_loss_dict,
learning_rate,
decoupled_learning_rate,
loss_scale,
report_memory_flag,
skipped_iter,
grad_norm,
params_norm,
num_zeros_in_grad,
config,
global_state,
history_wct,
model,
log_max_attention_logit,
)
loss_scale = 1.0
params_norm = None
if config.logger.log_params_norm:
params_norm = calc_params_l2_norm(model, model_config, use_megatron_fsdp=config.dist.use_megatron_fsdp)
learning_rate = None
decoupled_learning_rate = None
for param_group in optimizer.param_groups:
if len(param_group) == 0:
continue
if param_group["is_decoupled_lr"]:
decoupled_learning_rate = param_group["lr"]
else:
learning_rate = param_group["lr"]
report_memory_flag = training_log(
loss_dict,
total_loss_dict,
learning_rate,
decoupled_learning_rate,
loss_scale,
report_memory_flag,
skipped_iter,
grad_norm,
params_norm,
num_zeros_in_grad,
config,
global_state,
history_wct,
model,
log_max_attention_logit,
)
logging_enabled = (
config.logger.tensorboard_dir is not None
or global_state.wandb_logger is not None
or global_state.mlflow_logger is not None
or config.logger.log_progress
)
if logging_enabled:
if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer:
loss_scale = optimizer.get_loss_scale().item()
else:
loss_scale = 1.0
params_norm = None
if config.logger.log_params_norm:
params_norm = calc_params_l2_norm(model, model_config, use_megatron_fsdp=config.dist.use_megatron_fsdp)
learning_rate = None
decoupled_learning_rate = None
for param_group in optimizer.param_groups:
if len(param_group) == 0:
continue
if param_group["is_decoupled_lr"]:
decoupled_learning_rate = param_group["lr"]
else:
learning_rate = param_group["lr"]
report_memory_flag = training_log(
loss_dict,
total_loss_dict,
learning_rate,
decoupled_learning_rate,
loss_scale,
report_memory_flag,
skipped_iter,
grad_norm,
params_norm,
num_zeros_in_grad,
config,
global_state,
history_wct,
model,
log_max_attention_logit,
)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` around lines 470 - 506, The current
outer check uses config.logger.tensorboard_dir to gate all logging, which
unintentionally disables WandB/MLflow/console logging; remove or loosen that
gate so training_log(...) is invoked regardless of tensorboard_dir (or replace
the condition with a check that any sink is enabled, e.g.,
config.logger.use_tensorboard or config.logger.use_wandb or
config.logger.use_mlflow), leaving the inner logic that computes loss_scale,
params_norm, learning_rate, etc., intact; keep the call to training_log(...) so
it can decide which sinks actually log.

Comment thread src/megatron/bridge/training/train.py Outdated
Comment on lines +770 to +778
if train_config.numeric_checks:
update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)
# grad_norm and num_zeros_in_grad will be None on ranks without trainable params,
# so we must gather across mp ranks
grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp)
if optim_config.log_num_zeros_in_grad:
num_zeros_in_grad = reduce_max_stat_across_model_parallel_group(
num_zeros_in_grad, mp_group=pg_collection.mp
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep update_successful synchronized regardless of numeric_checks.

If numeric checks are disabled, ranks can diverge on update_successful, causing inconsistent scheduler stepping. You can still gate grad/zero-count reductions while always synchronizing update_successful.

✅ Suggested fix
-    if train_config.numeric_checks:
-        update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)
+    update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)
+    if train_config.numeric_checks:
         # grad_norm and num_zeros_in_grad will be None on ranks without trainable params,
         # so we must gather across mp ranks
         grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp)
         if optim_config.log_num_zeros_in_grad:
             num_zeros_in_grad = reduce_max_stat_across_model_parallel_group(
                 num_zeros_in_grad, mp_group=pg_collection.mp
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if train_config.numeric_checks:
update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)
# grad_norm and num_zeros_in_grad will be None on ranks without trainable params,
# so we must gather across mp ranks
grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp)
if optim_config.log_num_zeros_in_grad:
num_zeros_in_grad = reduce_max_stat_across_model_parallel_group(
num_zeros_in_grad, mp_group=pg_collection.mp
)
update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)
if train_config.numeric_checks:
# grad_norm and num_zeros_in_grad will be None on ranks without trainable params,
# so we must gather across mp ranks
grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp)
if optim_config.log_num_zeros_in_grad:
num_zeros_in_grad = reduce_max_stat_across_model_parallel_group(
num_zeros_in_grad, mp_group=pg_collection.mp
)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` around lines 770 - 778, The check for
update_successful must always be synchronized across model-parallel ranks: move
the call to logical_and_across_model_parallel_group(update_successful,
mp_group=pg_collection.mp) out of the train_config.numeric_checks conditional so
update_successful is reduced unconditionally; leave the grad_norm and
num_zeros_in_grad reductions (reduce_max_stat_across_model_parallel_group)
inside the if train_config.numeric_checks block so those stats are only gathered
when numeric checks are enabled.

Comment thread src/megatron/bridge/training/train.py Outdated
Comment on lines +1166 to +1167
if not state.cfg.checkpoint.save:
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Exit conditions should still apply without checkpoint saving.

The early return skips exit-signal, duration, interval, and straggler checks when checkpoint.save is unset. That can cause runs to ignore exit conditions entirely.

🛠️ Suggested fix
-    if not state.cfg.checkpoint.save:
-        return False
-
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` around lines 1166 - 1167, The early
return guarded by "if not state.cfg.checkpoint.save: return False" prevents the
exit-signal, duration, interval, and straggler checks from running; instead of
returning immediately when state.cfg.checkpoint.save is false, skip only the
checkpoint save action and still run the existing exit-condition checks
(exit-signal, duration, interval, straggler) before returning; locate the block
that references state.cfg.checkpoint.save in train.py and refactor so checkpoint
saving is conditional but the exit-checking logic (the functions/conditions that
evaluate exit signals, duration, interval and stragglers) always executes
regardless of checkpoint.save.

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>
@gautham-kollu

Copy link
Copy Markdown
Contributor Author

/ok to test 349d5b1

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>
@gautham-kollu

Copy link
Copy Markdown
Contributor Author

/ok to test 1719976

Comment thread src/megatron/bridge/training/train.py Outdated
continue
if param_group["is_decoupled_lr"]:
decoupled_learning_rate = param_group["lr"]
if config.logger.tensorboard_dir is not None: # Skip logging as tensorboard logging is disabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with changes proposed in #2153, this will no longer be true. after 2153, users could log to tensorboard, W&B, and mlflow independently vs the current state where tensorboard is required to log to w&b. we can have some check up front whether logging will be enabled. this will be constant throughout training and can be done outside of the loop

@gautham-kollu gautham-kollu Feb 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next rev

Returns:
NVTX context if nsys profiling was started at this step, None otherwise
"""
if config is None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the first thing checked in should_profile_rank - why do we need to repeat this here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some sense it is a bit overboard but the reason I added this was because of OmegaConfig.
If the config is converted from a raw python dictionary to OmegeConf-dict, .get() on omegaconfig-dict starts applying a lot of resolution-rules etc and ends up calling a lot of functions.

Walltime are minimal
Screenshot 2026-02-05 at 12 45 07 AM

Screenshot 2026-02-05 at 12 44 24 AM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, when is it an omegaconf during training? shouldn't it already be resolved to a dataclass?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fields in the class could be python raw List or an OmegaConfig.ListConfig

Comment thread src/megatron/bridge/training/config.py Outdated
check_weight_hash_across_dp_replicas_interval: Optional[int] = None
"""Interval to check weight hashes are same across DP replicas. If not specified, weight hashes not checked."""

numeric_checks: bool = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the config field name is really broad, so it's impossible to tell exactly which checks this disables based on the name alone

@gautham-kollu gautham-kollu Feb 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next rev

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>
@gautham-kollu

Copy link
Copy Markdown
Contributor Author

/ok to test 65ea63c

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>
@gautham-kollu

Copy link
Copy Markdown
Contributor Author

/ok to test 0f4b5a7

@gautham-kollu gautham-kollu merged commit 5622b5c into main Feb 11, 2026
55 checks passed
@gautham-kollu gautham-kollu deleted the gk/bridge_perfs branch February 11, 2026 09:08
@gautham-kollu gautham-kollu added cherry-pick r0.3.0 Cherry-pick label for r0.3.0 release branch labels Feb 11, 2026
sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026
…2199)

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>
Signed-off-by: gautham-kollu <gkollu@nvidia.com>
Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: sowmen <sowmendipta@gmail.com>
ko3n1g added a commit that referenced this pull request Feb 12, 2026
@gautham-kollu gautham-kollu mentioned this pull request Feb 12, 2026
5 tasks
gautham-kollu added a commit that referenced this pull request Feb 12, 2026
Signed-off-by: gkollu <gkollu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants