Bridging perf from NeMo2 to Mbridge for certain configs by gautham-kollu · Pull Request #2199 · NVIDIA-NeMo/Megatron-Bridge

gautham-kollu · 2026-02-03T18:54:15Z

What does this PR do ?

Code changes

Instead of creating one dataset iterator per model_chunk when VP is enabled, creates one and making the other things shallow point the first one. Previously this was already implemented for finetuning, extending this to pretraining
Adding a new config to not do numeric_checks post_training. (This is only for advanced users)
Skip computing log metrics if logging is not enabled.
Precompute a few things outside of the training loop to avoid redundant compute

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added numeric_checks configuration option for training validation.
Improvements
- Enhanced multi-model training support with improved iterator handling.
- Optimized training loop logging behavior.
- Refined fault tolerance and profiling operation execution.
- Improved checkpoint decision logic with early-exit handling.

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

copy-pr-bot · 2026-02-03T18:54:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

coderabbitai · 2026-02-03T18:57:04Z

📝 Walkthrough

Walkthrough

The changes introduce P2P communication support for training, add multi-model chunk data iterator handling, refactor logging and loss scaling logic, simplify data loader initialization, and add early-exit guards to fault-tolerance and profiling modules. A new numeric checks configuration option is also added.

Changes

Cohort / File(s)	Summary
Data Loading & Iteration `src/megatron/bridge/data/loaders.py`, `src/megatron/bridge/training/eval.py`	Simplified data iterator initialization by replacing conditional multi-iterator setup with direct calls. Multi-chunk model support added for evaluation path via `make_data_iterator_list`.
Configuration `src/megatron/bridge/training/config.py`	Added new `numeric_checks: bool = True` field to `TrainingConfig` class for enabling numeric validation during training.
Training & P2P Communication `src/megatron/bridge/training/train.py`	Introduces P2PCommunicator instance creation and propagation through training steps. Adds multi-model data iterator conversion, refactors logging/loss-scaling logic with tensorboard condition gating, and adds checkpoint-saving early-exit guard.
Safety Guards `src/megatron/bridge/training/fault_tolerance.py`, `src/megatron/bridge/training/profiling.py`	Added early-return guards in fault-tolerance callbacks when FT config is absent and in profiling handlers when config is None, making FT and profiling operations no-ops in unconfigured states.

Sequence Diagram(s)

sequenceDiagram
    participant Train as Training Loop
    participant P2P as P2PCommunicator
    participant FB as forward_backward_func
    participant MI as Model Iterators
    participant Logger as Logger

    Train->>P2P: Create P2PCommunicator(pp_group, config)
    Train->>Train: Prepare forward_backward_data_iterator
    alt Multi-model setup
        Train->>MI: make_data_iterator_list(iterator)
        MI->>Train: List of iterators (one per chunk)
    end
    Train->>FB: Call with p2p_communicator & data iterators
    FB->>P2P: Use P2P for peer-to-peer communication
    FB->>Train: Return loss & metrics
    alt Logging enabled
        Train->>Logger: Extract learning_rate from optimizer
        Train->>Logger: Log loss_scale, learning_rate, params_norm
        Logger->>Train: Logging complete
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR description lacks required documentation: no concrete overview, no actual changelog entries, no test results, no performance benchmarks, and multiple unresolved review comments.	Complete PR description with concrete overview, actual changelog entries, test results, before-and-after performance benchmarks, configuration context, and address all open review comments.
Title check	❓ Inconclusive	The title is vague and generic, using broad terms like 'perf' and 'certain configs' without clearly identifying the main technical change or which specific components are being modified.	Revise the title to be more specific and descriptive of the main technical change, such as identifying the key modifications (e.g., 'Simplify data iterator construction for virtual pipeline parallelism') or which specific performance optimization is being addressed.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch gk/bridge_perfs

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gautham-kollu · 2026-02-03T19:00:21Z

/ok to test d694715

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/megatron/bridge/data/loaders.py (1)
360-382: ⚠️ Potential issue | 🟡 Minor

Update setup_data_iterators docstring to match single-pass behavior.

The function no longer builds per-virtual-stage iterators, but the docstring still says it does. Please align the docstring (and return description) with the new behavior to avoid confusion.
📝 Suggested docstring update
-    Calls `build_train_valid_test_data_iterators` potentially multiple times
-    if virtual pipeline parallelism is used, creating separate iterators for each
-    virtual stage.
+    Builds train/valid/test iterators in a single call and returns them directly.
+    Virtual pipeline parallelism is handled downstream by iterator wrapping.
@@
-        Each element can be a single iterator or a list of iterators if virtual
-        pipeline parallelism is enabled.
+        Each element is a single iterator; downstream code may wrap it for virtual
+        pipeline parallelism when needed.
src/megatron/bridge/training/train.py (1)
713-721: ⚠️ Potential issue | 🟠 Major

Skip make_data_iterator_list when the iterator is already a list.

External dataloaders may already provide a list of iterators for VPP. Wrapping a list will cause make_data_iterator_list to call next() on a list and fail.
🐛 Suggested fix
-        if len(model) > 1:
+        if len(model) > 1 and not isinstance(forward_backward_data_iterator, list):
             # As MLM, expects a list of iterators for virtual pipeline parallelism. One iterator per model chunk.
             forward_backward_data_iterator = make_data_iterator_list(
                 model=model,
                 data_iterator=forward_backward_data_iterator,
             )

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/training/eval.py`:
- Around line 137-150: The guard around make_data_iterator_list should avoid
re-wrapping pre-built lists of iterators; change the branch that currently does
make_data_iterator_list(model=model, data_iterator=eval_data_iterator) to only
call make_data_iterator_list when eval_data_iterator is not already a sequence
of iterators (e.g., check isinstance(eval_data_iterator, (list, tuple)) or
similar) so that prepare_finetuning_batch’s returned eval_data_iterator is left
untouched when it is a list/tuple; reference prepare_finetuning_batch,
eval_data_iterator, make_data_iterator_list, and model when applying this
conditional check.

In `@src/megatron/bridge/training/train.py`:
- Line 221: Remove the commented-out debug toggles to eliminate dead code:
delete the lines containing the commented variable name
should_toggle_forward_pre_hook (and any duplicate/commented occurrences around
the other reported location) unless you intentionally want to keep them; if so,
replace the bare commented lines with a short rationale comment explaining
when/how to enable this debug toggle and why it is preserved (e.g., "kept for
temporary debugging of forward pre-hook behavior; remove before merge"). Ensure
you update any nearby comments in the same function or module (train.py) to
reflect the change.
- Around line 1166-1167: The early return guarded by "if not
state.cfg.checkpoint.save: return False" prevents the exit-signal, duration,
interval, and straggler checks from running; instead of returning immediately
when state.cfg.checkpoint.save is false, skip only the checkpoint save action
and still run the existing exit-condition checks (exit-signal, duration,
interval, straggler) before returning; locate the block that references
state.cfg.checkpoint.save in train.py and refactor so checkpoint saving is
conditional but the exit-checking logic (the functions/conditions that evaluate
exit signals, duration, interval and stragglers) always executes regardless of
checkpoint.save.
- Around line 770-778: The check for update_successful must always be
synchronized across model-parallel ranks: move the call to
logical_and_across_model_parallel_group(update_successful,
mp_group=pg_collection.mp) out of the train_config.numeric_checks conditional so
update_successful is reduced unconditionally; leave the grad_norm and
num_zeros_in_grad reductions (reduce_max_stat_across_model_parallel_group)
inside the if train_config.numeric_checks block so those stats are only gathered
when numeric checks are enabled.
- Around line 470-506: The current outer check uses
config.logger.tensorboard_dir to gate all logging, which unintentionally
disables WandB/MLflow/console logging; remove or loosen that gate so
training_log(...) is invoked regardless of tensorboard_dir (or replace the
condition with a check that any sink is enabled, e.g.,
config.logger.use_tensorboard or config.logger.use_wandb or
config.logger.use_mlflow), leaving the inner logic that computes loss_scale,
params_norm, learning_rate, etc., intact; keep the call to training_log(...) so
it can decide which sinks actually log.

coderabbitai · 2026-02-03T19:06:50Z

+                eval_data_iterator, seq_length = prepare_finetuning_batch(
                    data_iterator=data_iterator,
                    num_microbatches=eval_num_microbatches,
                    default_seq_length=state.cfg.model.seq_length,
                    seq_key="tokens",
                )

+            if len(model) > 1:
                # Convert to list of iterators for virtual pipeline parallelism
                # With virtual PP, each model chunk needs independent access to the same microbatch
                eval_data_iterator = make_data_iterator_list(
                    model=model,
-                    data_iterator=eval_microbatch_iterator,
+                    data_iterator=eval_data_iterator,
                )


⚠️ Potential issue | 🟠 Major

Avoid wrapping pre-built iterator lists in make_data_iterator_list.

When data_iterator is already a list (external dataloaders or pre-wrapped VPP), make_data_iterator_list will treat the list as an iterator and next() will fail. Guard the wrapping to avoid a TypeError.

🐛 Suggested fix

- if len(model) > 1: + if len(model) > 1 and not isinstance(eval_data_iterator, list): # Convert to list of iterators for virtual pipeline parallelism # With virtual PP, each model chunk needs independent access to the same microbatch eval_data_iterator = make_data_iterator_list( model=model, data_iterator=eval_data_iterator, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

eval_data_iterator, seq_length = prepare_finetuning_batch(

data_iterator=data_iterator,

num_microbatches=eval_num_microbatches,

default_seq_length=state.cfg.model.seq_length,

seq_key="tokens",

)

if len(model) > 1:

# Convert to list of iterators for virtual pipeline parallelism

# With virtual PP, each model chunk needs independent access to the same microbatch

eval_data_iterator = make_data_iterator_list(

model=model,

data_iterator=eval_microbatch_iterator,

data_iterator=eval_data_iterator,

)

eval_data_iterator, seq_length = prepare_finetuning_batch(

data_iterator=data_iterator,

num_microbatches=eval_num_microbatches,

default_seq_length=state.cfg.model.seq_length,

seq_key="tokens",

)

if len(model) > 1 and not isinstance(eval_data_iterator, list):

# Convert to list of iterators for virtual pipeline parallelism

# With virtual PP, each model chunk needs independent access to the same microbatch

eval_data_iterator = make_data_iterator_list(

model=model,

data_iterator=eval_data_iterator,

)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/eval.py` around lines 137 - 150, The guard around make_data_iterator_list should avoid re-wrapping pre-built lists of iterators; change the branch that currently does make_data_iterator_list(model=model, data_iterator=eval_data_iterator) to only call make_data_iterator_list when eval_data_iterator is not already a sequence of iterators (e.g., check isinstance(eval_data_iterator, (list, tuple)) or similar) so that prepare_finetuning_batch’s returned eval_data_iterator is left untouched when it is a list/tuple; reference prepare_finetuning_batch, eval_data_iterator, make_data_iterator_list, and model when applying this conditional check.

coderabbitai · 2026-02-03T19:06:50Z

        config.optimizer.use_distributed_optimizer,
        config.ddp.overlap_param_gather,
    )
+    #should_toggle_forward_pre_hook = False


⚠️ Potential issue | 🟡 Minor

Remove commented-out debug toggles.

These commented lines add noise and violate the no-dead-code rule. Either remove them or add a rationale explaining why they must remain commented out.

🧹 Suggested cleanup

- `#should_toggle_forward_pre_hook` = False @@ - `#if` config.train.rampup_batch_size is not None:

As per coding guidelines: If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging.

Also applies to: 309-309

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/train.py` at line 221, Remove the commented-out debug toggles to eliminate dead code: delete the lines containing the commented variable name should_toggle_forward_pre_hook (and any duplicate/commented occurrences around the other reported location) unless you intentionally want to keep them; if so, replace the bare commented lines with a short rationale comment explaining when/how to enable this debug toggle and why it is preserved (e.g., "kept for temporary debugging of forward pre-hook behavior; remove before merge"). Ensure you update any nearby comments in the same function or module (train.py) to reflect the change.

coderabbitai · 2026-02-03T19:06:50Z

+        if config.logger.tensorboard_dir is not None: # Skip logging as tensorboard logging is disabled.
+            if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer:
+                loss_scale = optimizer.get_loss_scale().item()
            else:
-                learning_rate = param_group["lr"]
-        report_memory_flag = training_log(
-            loss_dict,
-            total_loss_dict,
-            learning_rate,
-            decoupled_learning_rate,
-            loss_scale,
-            report_memory_flag,
-            skipped_iter,
-            grad_norm,
-            params_norm,
-            num_zeros_in_grad,
-            config,
-            global_state,
-            history_wct,
-            model,
-            log_max_attention_logit,
-        )
-
+                loss_scale = 1.0
+            params_norm = None
+
+            if config.logger.log_params_norm:
+                params_norm = calc_params_l2_norm(model, model_config, use_megatron_fsdp=config.dist.use_megatron_fsdp)
+
+            learning_rate = None
+            decoupled_learning_rate = None
+            for param_group in optimizer.param_groups:
+                if len(param_group) == 0:
+                    continue
+                if param_group["is_decoupled_lr"]:
+                    decoupled_learning_rate = param_group["lr"]
+                else:
+                    learning_rate = param_group["lr"]
+
+            report_memory_flag = training_log(
+                loss_dict,
+                total_loss_dict,
+                learning_rate,
+                decoupled_learning_rate,
+                loss_scale,
+                report_memory_flag,
+                skipped_iter,
+                grad_norm,
+                params_norm,
+                num_zeros_in_grad,
+                config,
+                global_state,
+                history_wct,
+                model,
+                log_max_attention_logit,
+            )


⚠️ Potential issue | 🟠 Major

Don’t gate all logging on tensorboard_dir.

This disables WandB/MLflow and console logging when tensorboard_dir is unset (default), which is a behavior regression. Gate on any enabled sink (or keep the call and let training_log decide).

🔧 Suggested fix

- if config.logger.tensorboard_dir is not None: # Skip logging as tensorboard logging is disabled. + logging_enabled = ( + config.logger.tensorboard_dir is not None + or global_state.wandb_logger is not None + or global_state.mlflow_logger is not None + or config.logger.log_progress + ) + if logging_enabled: if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer: loss_scale = optimizer.get_loss_scale().item() else: loss_scale = 1.0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if config.logger.tensorboard_dir is not None: # Skip logging as tensorboard logging is disabled.

if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer:

loss_scale = optimizer.get_loss_scale().item()

else:

learning_rate = param_group["lr"]

report_memory_flag = training_log(

loss_dict,

total_loss_dict,

learning_rate,

decoupled_learning_rate,

loss_scale,

report_memory_flag,

skipped_iter,

grad_norm,

params_norm,

num_zeros_in_grad,

config,

global_state,

history_wct,

model,

log_max_attention_logit,

)

loss_scale = 1.0

params_norm = None

if config.logger.log_params_norm:

params_norm = calc_params_l2_norm(model, model_config, use_megatron_fsdp=config.dist.use_megatron_fsdp)

learning_rate = None

decoupled_learning_rate = None

for param_group in optimizer.param_groups:

if len(param_group) == 0:

continue

if param_group["is_decoupled_lr"]:

decoupled_learning_rate = param_group["lr"]

else:

learning_rate = param_group["lr"]

report_memory_flag = training_log(

loss_dict,

total_loss_dict,

learning_rate,

decoupled_learning_rate,

loss_scale,

report_memory_flag,

skipped_iter,

grad_norm,

params_norm,

num_zeros_in_grad,

config,

global_state,

history_wct,

model,

log_max_attention_logit,

)

logging_enabled = (

config.logger.tensorboard_dir is not None

or global_state.wandb_logger is not None

or global_state.mlflow_logger is not None

or config.logger.log_progress

)

if logging_enabled:

if hasattr(optimizer, "is_stub_optimizer") and not optimizer.is_stub_optimizer:

loss_scale = optimizer.get_loss_scale().item()

else:

loss_scale = 1.0

params_norm = None

if config.logger.log_params_norm:

params_norm = calc_params_l2_norm(model, model_config, use_megatron_fsdp=config.dist.use_megatron_fsdp)

learning_rate = None

decoupled_learning_rate = None

for param_group in optimizer.param_groups:

if len(param_group) == 0:

continue

if param_group["is_decoupled_lr"]:

decoupled_learning_rate = param_group["lr"]

else:

learning_rate = param_group["lr"]

report_memory_flag = training_log(

loss_dict,

total_loss_dict,

learning_rate,

decoupled_learning_rate,

loss_scale,

report_memory_flag,

skipped_iter,

grad_norm,

params_norm,

num_zeros_in_grad,

config,

global_state,

history_wct,

model,

log_max_attention_logit,

)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/train.py` around lines 470 - 506, The current outer check uses config.logger.tensorboard_dir to gate all logging, which unintentionally disables WandB/MLflow/console logging; remove or loosen that gate so training_log(...) is invoked regardless of tensorboard_dir (or replace the condition with a check that any sink is enabled, e.g., config.logger.use_tensorboard or config.logger.use_wandb or config.logger.use_mlflow), leaving the inner logic that computes loss_scale, params_norm, learning_rate, etc., intact; keep the call to training_log(...) so it can decide which sinks actually log.

coderabbitai · 2026-02-03T19:06:50Z

+    if train_config.numeric_checks:
+        update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)
+        # grad_norm and num_zeros_in_grad will be None on ranks without trainable params,
+        # so we must gather across mp ranks
+        grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp)
+        if optim_config.log_num_zeros_in_grad:
+            num_zeros_in_grad = reduce_max_stat_across_model_parallel_group(
+                num_zeros_in_grad, mp_group=pg_collection.mp
+            )


⚠️ Potential issue | 🟠 Major

Keep update_successful synchronized regardless of numeric_checks.

If numeric checks are disabled, ranks can diverge on update_successful, causing inconsistent scheduler stepping. You can still gate grad/zero-count reductions while always synchronizing update_successful.

✅ Suggested fix

- if train_config.numeric_checks: - update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp) + update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp) + if train_config.numeric_checks: # grad_norm and num_zeros_in_grad will be None on ranks without trainable params, # so we must gather across mp ranks grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp) if optim_config.log_num_zeros_in_grad: num_zeros_in_grad = reduce_max_stat_across_model_parallel_group( num_zeros_in_grad, mp_group=pg_collection.mp )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if train_config.numeric_checks:

update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)

# grad_norm and num_zeros_in_grad will be None on ranks without trainable params,

# so we must gather across mp ranks

grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp)

if optim_config.log_num_zeros_in_grad:

num_zeros_in_grad = reduce_max_stat_across_model_parallel_group(

num_zeros_in_grad, mp_group=pg_collection.mp

)

update_successful = logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp)

if train_config.numeric_checks:

# grad_norm and num_zeros_in_grad will be None on ranks without trainable params,

# so we must gather across mp ranks

grad_norm = reduce_max_stat_across_model_parallel_group(grad_norm, mp_group=pg_collection.mp)

if optim_config.log_num_zeros_in_grad:

num_zeros_in_grad = reduce_max_stat_across_model_parallel_group(

num_zeros_in_grad, mp_group=pg_collection.mp

)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/train.py` around lines 770 - 778, The check for update_successful must always be synchronized across model-parallel ranks: move the call to logical_and_across_model_parallel_group(update_successful, mp_group=pg_collection.mp) out of the train_config.numeric_checks conditional so update_successful is reduced unconditionally; leave the grad_norm and num_zeros_in_grad reductions (reduce_max_stat_across_model_parallel_group) inside the if train_config.numeric_checks block so those stats are only gathered when numeric checks are enabled.

coderabbitai · 2026-02-03T19:06:50Z

+    if not state.cfg.checkpoint.save:
+        return False


⚠️ Potential issue | 🟠 Major

Exit conditions should still apply without checkpoint saving.

The early return skips exit-signal, duration, interval, and straggler checks when checkpoint.save is unset. That can cause runs to ignore exit conditions entirely.

🛠️ Suggested fix

- if not state.cfg.checkpoint.save: - return False -

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/train.py` around lines 1166 - 1167, The early return guarded by "if not state.cfg.checkpoint.save: return False" prevents the exit-signal, duration, interval, and straggler checks from running; instead of returning immediately when state.cfg.checkpoint.save is false, skip only the checkpoint save action and still run the existing exit-condition checks (exit-signal, duration, interval, straggler) before returning; locate the block that references state.cfg.checkpoint.save in train.py and refactor so checkpoint saving is conditional but the exit-checking logic (the functions/conditions that evaluate exit signals, duration, interval and stragglers) always executes regardless of checkpoint.save.

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

gautham-kollu · 2026-02-03T22:10:27Z

/ok to test 349d5b1

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

gautham-kollu · 2026-02-03T22:20:45Z

/ok to test 1719976

ananthsub · 2026-02-03T22:47:06Z

-                continue
-            if param_group["is_decoupled_lr"]:
-                decoupled_learning_rate = param_group["lr"]
+        if config.logger.tensorboard_dir is not None:  # Skip logging as tensorboard logging is disabled.


with changes proposed in #2153, this will no longer be true. after 2153, users could log to tensorboard, W&B, and mlflow independently vs the current state where tensorboard is required to log to w&b. we can have some check up front whether logging will be enabled. this will be constant throughout training and can be done outside of the loop

ananthsub · 2026-02-03T22:57:06Z

    Returns:
        NVTX context if nsys profiling was started at this step, None otherwise
    """
+    if config is None:


this is the first thing checked in should_profile_rank - why do we need to repeat this here?

In some sense it is a bit overboard but the reason I added this was because of OmegaConfig.
If the config is converted from a raw python dictionary to OmegeConf-dict, .get() on omegaconfig-dict starts applying a lot of resolution-rules etc and ends up calling a lot of functions.

Walltime are minimal

out of curiosity, when is it an omegaconf during training? shouldn't it already be resolved to a dataclass?

The fields in the class could be python raw List or an OmegaConfig.ListConfig

ananthsub · 2026-02-03T22:59:04Z

    check_weight_hash_across_dp_replicas_interval: Optional[int] = None
    """Interval to check weight hashes are same across DP replicas. If not specified, weight hashes not checked."""

+    numeric_checks: bool = True


the config field name is really broad, so it's impossible to tell exactly which checks this disables based on the name alone

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

gautham-kollu · 2026-02-05T09:07:41Z

/ok to test 65ea63c

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

gautham-kollu · 2026-02-05T09:19:27Z

/ok to test 0f4b5a7

…2199) Signed-off-by: Gautham Kollu <gkollu@nvidia.com> Signed-off-by: gautham-kollu <gkollu@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

This reverts commit 5622b5c.

…" (#2352)

Signed-off-by: gkollu <gkollu@nvidia.com>

Adding some changes for perf

5659c24

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

remove code

d694715

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

gautham-kollu requested a review from ananthsub February 3, 2026 19:00

copy-pr-bot Bot temporarily deployed to nemo-ci February 3, 2026 19:00 Inactive

coderabbitai Bot reviewed Feb 3, 2026

View reviewed changes

gautham-kollu added 2 commits February 3, 2026 13:56

Merge branch 'main' into gk/bridge_perfs

4378e77

fix lint

349d5b1

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

copy-pr-bot Bot had a problem deploying to nemo-ci February 3, 2026 22:11 Error

copy-pr-bot Bot temporarily deployed to test February 3, 2026 22:11 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci February 3, 2026 22:15 Error

clean up

1719976

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci February 3, 2026 22:21 Inactive

copy-pr-bot Bot temporarily deployed to test February 3, 2026 22:21 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 3, 2026 22:29 Inactive

ananthsub reviewed Feb 3, 2026

View reviewed changes

copy-pr-bot Bot had a problem deploying to nemo-ci February 3, 2026 23:17 Failure

PR comments

65ea63c

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

copy-pr-bot Bot had a problem deploying to nemo-ci February 5, 2026 09:08 Error

copy-pr-bot Bot had a problem deploying to test February 5, 2026 09:08 Error

docs

0f4b5a7

Signed-off-by: Gautham Kollu <gkollu@nvidia.com>

copy-pr-bot Bot temporarily deployed to test February 5, 2026 09:20 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 5, 2026 09:20 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 5, 2026 10:08 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 11, 2026 03:46 Inactive

ananthsub approved these changes Feb 11, 2026

View reviewed changes

gautham-kollu merged commit 5622b5c into main Feb 11, 2026
55 checks passed

gautham-kollu deleted the gk/bridge_perfs branch February 11, 2026 09:08

gautham-kollu added cherry-pick r0.3.0 Cherry-pick label for r0.3.0 release branch labels Feb 11, 2026

ko3n1g added a commit that referenced this pull request Feb 12, 2026

Revert "Bridging perf from NeMo2 to Mbridge for certain configs (#2199)"

cec042d

This reverts commit 5622b5c.

coderabbitai Bot mentioned this pull request Feb 12, 2026

Revert "Bridging perf from NeMo2 to Mbridge for certain configs (#2199)" #2352

Merged

5 tasks

ko3n1g added a commit that referenced this pull request Feb 12, 2026

Revert "Bridging perf from NeMo2 to Mbridge for certain configs (#2199)…

9026845

…" (#2352)

gautham-kollu mentioned this pull request Feb 12, 2026

Reapply #2199 #2359

Merged

5 tasks

gautham-kollu added a commit that referenced this pull request Feb 12, 2026

Reapply #2199 (#2359)

ffcd56a

Signed-off-by: gkollu <gkollu@nvidia.com>

This was referenced Feb 12, 2026

[example, doc] fix: Fix RLHF script bugs and tutorial tokenization path #2364

Closed

support for training qwen3 vl with dist train #2367

Merged

feat(evaluate): support multiple validation datasets + decouple val/train dataloader config #685

Open

coderabbitai Bot mentioned this pull request Feb 26, 2026

Add vision flops calculation for vlm #2575

Open

4 tasks

Conversation

gautham-kollu commented Feb 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Feb 3, 2026

Uh oh!

coderabbitai Bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

gautham-kollu commented Feb 3, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gautham-kollu commented Feb 3, 2026

Uh oh!

gautham-kollu commented Feb 3, 2026

Uh oh!

ananthsub Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gautham-kollu Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ananthsub Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gautham-kollu Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ananthsub Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gautham-kollu Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ananthsub Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gautham-kollu Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gautham-kollu commented Feb 5, 2026

Uh oh!

gautham-kollu commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

gautham-kollu commented Feb 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 3, 2026 •

edited

Loading

gautham-kollu Feb 5, 2026 •

edited

Loading

gautham-kollu Feb 5, 2026 •

edited

Loading