feat: Add first-class Comet ML experiment tracking by cuichenx · Pull Request #2910 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-03-20T00:34:33Z

Summary

Adds Comet ML as a fourth experiment tracking backend alongside TensorBoard, Weights & Biases, and MLflow, following the same structural pattern
New config fields (comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags) in LoggerConfig with finalize() validation
GlobalState.comet_logger property lazily initializes a comet_ml.Experiment on the last rank, logs full training config as parameters, and supports tags
All ~18 metric logging call sites in training_log() now dispatch to Comet alongside WandB and MLflow; validation metrics in eval.py also logged
Timer metrics via _timers_write_to_comet (no metric name sanitization — Comet supports / natively)
Checkpoint tracking callbacks in comet_utils.py wired into checkpointing.py
_CometExperimentLogger wrapper for NVIDIA DLFw Inspect MetricLogger
CometPlugin added to run_plugins.py for NeMo Run launching
Documentation added to docs/training/logging.md

Fixes: #2652

Based on @shanecmoran's original PR #2653, rebased onto current main with uv.lock conflict resolved.

Test plan

Unit tests for LoggerConfig comet fields (test_config.py)
Unit tests for GlobalState.comet_logger property and _timers_write_to_comet (test_state.py)
Unit tests for comet_utils.py checkpoint callbacks (test_comet_utils.py)
Functional test: run training with comet_project set and verify metrics appear in Comet dashboard

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added Comet ML as a metrics and metadata backend for training, alongside TensorBoard, W&B, and MLflow
- Support for configuring Comet ML project, experiment name, workspace, and tags through LoggerConfig
Documentation
- Added Comet ML logging guide with setup instructions and configuration details
Tests
- Added unit tests for Comet ML integration

Add Comet ML as a fourth experiment tracking backend alongside TensorBoard, Weights & Biases, and MLflow. The integration follows the exact same structural pattern used by the existing backends. Config fields: comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags in LoggerConfig with finalize() validation. GlobalState.comet_logger property lazily initializes a comet_ml.Experiment on the last rank, logs the full training config as parameters, and supports tags. Training metrics: all ~18 metric logging call sites in training_log() now dispatch to Comet alongside WandB and MLflow. Validation metrics in eval.py are also logged. Timer metrics use a new _timers_write_to_comet patch (no metric name sanitization since Comet supports / in names natively). Checkpoint tracking: comet_utils.py provides on_save_checkpoint_success and on_load_checkpoint_success callbacks wired into checkpointing.py. Tensor inspect: _CometExperimentLogger wrapper registered with NVIDIA DLFw Inspect MetricLogger system. CometPlugin added to run_plugins.py for NeMo Run launching. Documentation added to docs/training/logging.md. Fixes: #2652 Signed-off-by: Shane Moran <shane.moran@shopify.com>

- Include comet_api_key and comet_tags in using_comet detection - Use truthy check for comet_experiment_name (handles None and "") - Strip whitespace from api_key before passing to Experiment - Use backing attribute _comet_logger during teardown to avoid accidental lazy initialization - Add stacklevel=2 to timer warning for better caller diagnostics Signed-off-by: Shane Moran <shane.moran@shopify.com>

wandb and mlflow are both in the main dependency list. Add comet-ml to match, since it's now a first-class logging backend. Signed-off-by: Shane Moran <shane.moran@shopify.com>

Signed-off-by: Shane Moran <shane.moran@shopify.com>

Adds missing markers to TestCometLoggerProperty and TestTimersWriteToComet in test_state.py for consistency with test_comet_utils.py. Signed-off-by: Shane Moran <shane.moran@shopify.com>

# Conflicts: # uv.lock

copy-pr-bot · 2026-03-20T00:34:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Regenerate uv.lock to include comet-ml and its transitive dependencies (configobj, dulwich, everett, python-box, requests-toolbelt, semantic-version, simplejson, wurlitzer). Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-03-20T00:40:55Z

/ok to test 2771522

coderabbitai · 2026-03-20T00:48:36Z

📝 Walkthrough

Walkthrough

Added first-class Comet ML integration alongside existing TensorBoard, W&B, and MLflow backends. Includes configuration fields, lazy-initialized logger in GlobalState, training/validation/timer/checkpoint metric logging, tensor inspection support, NeMo Run plugin, and comprehensive unit tests.

Changes

Cohort / File(s)	Summary
Configuration & Dependencies `pyproject.toml`, `src/megatron/bridge/training/config.py`	Added `comet_ml>=3.50.0` dependency and extended `LoggerConfig` with Comet fields (`comet_project`, `comet_experiment_name`, `comet_workspace`, `comet_api_key`, `comet_tags`); added validation requiring non-empty `comet_experiment_name` when `comet_project` is set, and import checking for `comet_ml` package.
State Management & Initialization `src/megatron/bridge/training/state.py`, `src/megatron/bridge/training/setup.py`	Added lazily-initialized `comet_logger` property to `GlobalState` (last-rank only, stores `comet_ml.Experiment`); extended `timers` initialization with `write_to_comet` method; added `_timers_write_to_comet()` for aggregating and logging timer metrics; updated `reset_for_restart()` to clear Comet logger.
Training Metrics & Logging `src/megatron/bridge/training/train_utils.py`, `src/megatron/bridge/training/eval.py`	Extended `training_log()` to emit all scalar/throughput/memory/performance metrics to Comet alongside WandB/MLflow; added Comet metric logging in `evaluate_and_print_results()` for validation loss and perplexity.
Checkpoint Lifecycle `src/megatron/bridge/training/checkpointing.py`, `src/megatron/bridge/training/utils/comet_utils.py`	Hooked Comet finalization callbacks into async/sync checkpoint save flows; added `comet_utils` module with `on_save_checkpoint_success()` and `on_load_checkpoint_success()` for logging checkpoint metadata (`last_saved_checkpoint`, `last_loaded_checkpoint`) to Comet.
Tensor Inspection Integration `src/megatron/bridge/training/tensor_inspect.py`	Extended `finalize_tensor_inspect_post_model_initialization()` to accept optional `comet_logger` parameter; added internal `_CometExperimentLogger` class to attach Comet as metric backend alongside TensorBoard/W&B.
NeMo Run Plugin `src/megatron/bridge/recipes/run_plugins.py`	Added `CometPlugin` to inject `COMET_API_KEY` into executor environment and generate CLI argument overrides for `logger.comet_project`, `logger.comet_workspace`, and `logger.comet_experiment_name` when launching training scripts.
Training Cleanup `src/megatron/bridge/training/train.py`	Added explicit teardown by calling `comet_logger.end()` in both main training cleanup path and `_finish_train()` function.
Documentation `docs/training/logging.md`	Added "Comet ML" section documenting configuration fields, installation, authentication via `COMET_API_KEY`, lazy initialization behavior, NeMo Run integration, and logged metrics/metadata.
Unit Tests `tests/unit_tests/training/test_config.py`, `tests/unit_tests/training/test_state.py`, `tests/unit_tests/training/utils/test_comet_utils.py`	Added comprehensive test suites covering `LoggerConfig.finalize()` validation, `GlobalState.comet_logger` property initialization (rank gating, error handling), `_timers_write_to_comet()` metric aggregation, and `comet_utils` checkpoint callback functions.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Training as Training Loop
    participant GlobalState
    participant TrainingLog
    participant Checkpointing
    participant CometML as Comet ML

    User->>Training: start training
    Training->>GlobalState: initialize (comet_logger property)
    GlobalState->>CometML: create Experiment
    CometML-->>GlobalState: return Experiment instance
    
    loop Training Steps
        Training->>TrainingLog: log metrics
        TrainingLog->>CometML: log_metrics(scalars, throughput, memory, etc.)
        TrainingLog->>CometML: write_to_comet() for timers
    end
    
    loop Validation Steps
        Training->>Training: evaluate_and_print_results()
        Training->>CometML: log_metrics(validation_loss, ppl)
    end
    
    Training->>Checkpointing: save checkpoint
    Checkpointing->>CometML: log_other(last_saved_checkpoint, iteration)
    
    Training->>GlobalState: cleanup (end comet_logger)
    GlobalState->>CometML: end() experiment

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: Add first-class Comet ML experiment tracking' accurately summarizes the main change: adding Comet ML as a first-class experiment tracking backend alongside existing backends.
Linked Issues check	✅ Passed	All coding objectives from issue `#2652` are fully implemented: config fields, lazy-initialized comet_logger, training/validation/timer metrics logging, checkpoint callbacks, tensor inspect integration, run plugin, and documentation.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing first-class Comet ML integration as specified in `#2652`. No unrelated modifications or unnecessary refactoring detected.
Docstring Coverage	✅ Passed	Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR documentation comprehensively documents ~397 lines of unit test code for the major Comet ML integration feature, covering config validation, logger initialization, timers, and checkpoint callbacks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/comet-ml-logging

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/megatron/bridge/training/utils/train_utils.py (2)
459-467: ⚠️ Potential issue | 🟠 Major

Comet timer metrics are reset before they’re written.

The surrounding reset logic only accounts for the W&B path. In a Comet-only run, timers.write(..., reset=True) clears the interval before write_to_comet() executes; with W&B enabled, write_to_wandb(..., reset=True) does the same. Defer the reset until the last sink or fan out a single timer snapshot.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/train_utils.py` around lines 459 - 467,
The timers are being reset before some sinks (notably Comet) get written; change
the logic in train_utils.py so the reset occurs only after the last sink is
written or produce a single snapshot and fan it out. Specifically, call
timers.write(...) with reset=False (or capture a snapshot via a non-resetting
API) then call write_to_wandb, write_to_mlflow, write_to_comet in sequence and
finally perform a single timers.write(..., reset=True) (or invoke the timers'
reset once) so that write_to_comet (and other downstream writers) receive the
same data; update the use of reset_in_tb and the calls to timers.write,
timers.write_to_wandb, timers.write_to_mlflow, and timers.write_to_comet
accordingly.
498-646: ⚠️ Potential issue | 🟠 Major

Don’t gate Comet metrics on a TensorBoard writer.

These new comet_logger.log_metrics(...) calls are still under the surrounding if writer and ... / if writer: guards. When TensorBoard is disabled, Comet silently drops these core metrics even though the experiment is active. Move the interval guard outside and only gate the writer.add_scalar(...) calls.

Also applies to: 713-720
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/train_utils.py` around lines 498 - 646,
The Comet logging calls (comet_logger.log_metrics(...)) are incorrectly nested
under the TensorBoard writer guards so Comet metrics are dropped when
TensorBoard is disabled; modify the metric-emission blocks (e.g., the blocks
using writer.add_scalar(...) and comet_logger.log_metrics(...) around
memory_report, runtime_report, l2_report, loss_dict, etc.) to only gate
writer.add_scalar(...) with the writer condition while always executing
comet_logger.log_metrics(...) when comet_logger is present, and replicate the
same change for the other affected block mentioned (the similar guard around the
block at lines referenced in the review). Ensure you update the conditional
structure (use separate if writer: for writer.add_scalar calls and a separate if
comet_logger: for comet_logger.log_metrics calls) for functions/variables such
as writer.add_scalar, comet_logger.log_metrics, wandb_writer.log, and
mlflow_logger.log_metrics so Comet logging is no longer behind the TensorBoard
writer guard.
src/megatron/bridge/training/state.py (1)
436-448: ⚠️ Potential issue | 🟠 Major

Call cleanup methods on W&B and Comet loggers in reset_for_restart before clearing them.

The reset_for_restart method clears both _wandb_logger and _comet_logger without calling their respective cleanup methods. In the normal teardown paths in train.py (lines 644-645 and 1320-1323), .finish() is called on the W&B logger and .end() is called on the Comet experiment before clearing. During an in-process restart, these experiments may be abandoned without proper closure, potentially losing buffered metrics.
🔧 Suggested fix
 def reset_for_restart(self) -> None:
     """Reset GlobalState components for in-process restart.

     This cleans up all stateful components that need to be reinitialized between restart iterations.
     The async calls queue for checkpointing is handled separately in aborting in order to clean up persistent workers.
     """
     self._timers = None
     self._train_state = None
     self._tensorboard_logger = None
+    if self._wandb_logger is not None:
+        self._wandb_logger.finish()
     self._wandb_logger = None
     self._mlflow_logger = None
+    if self._comet_logger is not None:
+        self._comet_logger.end()
     self._comet_logger = None
     self._energy_monitor = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/state.py` around lines 436 - 448, In
reset_for_restart, call the proper cleanup methods on the loggers before nulling
them: if self._wandb_logger is set, call its finish() method (handle/ignore
exceptions) then set self._wandb_logger = None; similarly, if self._comet_logger
is set, call its end() method (handle/ignore exceptions) then set
self._comet_logger = None; keep the rest of the existing resets (self._timers,
self._train_state, etc.) unchanged and ensure these calls occur prior to
clearing those attributes in the reset_for_restart method.

🧹 Nitpick comments (3)

tests/unit_tests/training/test_config.py (1)

2913-2929: Make the positive Comet tests prove the branch ran.

These cases only assert “no exception”. If finalize() stops treating comet_workspace, comet_api_key, or comet_tags as Comet activators, they still pass. Please assert that the patched importlib.import_module was exercised, and add standalone comet_api_key / comet_tags cases.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_config.py` around lines 2913 - 2929, The
positive Comet tests currently only check for no exception; update
test_finalize_with_comet_workspace_only and
test_finalize_with_all_comet_settings to assert that the patched
importlib.import_module mock was actually called (e.g., get the mock from the
patch context and assert_called/assert_called_once), and add two new tests that
construct LoggerConfig with only comet_api_key and only comet_tags respectively
and likewise patch importlib.import_module and assert it was exercised; locate
references to LoggerConfig.finalize and the test functions
test_finalize_with_comet_workspace_only/test_finalize_with_all_comet_settings to
modify and add these assertions and new test cases.

src/megatron/bridge/training/tensor_inspect.py (1)

75-78: Update docstring to include Comet ML.

The docstring still references only TensorBoard and W&B, but the function now also attaches Comet ML loggers.

📝 Suggested docstring update

 def _maybe_attach_metric_loggers(
     tensorboard_logger: Any | None, wandb_logger: Any | None, comet_logger: Any | None
 ) -> None:
-    """Attach supported metric loggers (TensorBoard, W&B raw module)."""
+    """Attach supported metric loggers (TensorBoard, W&B, Comet ML)."""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/tensor_inspect.py` around lines 75 - 78, The
docstring for _maybe_attach_metric_loggers currently mentions only TensorBoard
and W&B; update it to include Comet ML so it accurately describes all supported
metric loggers (TensorBoard, W&B raw module, and Comet ML) and reference the
parameters tensorboard_logger, wandb_logger, and comet_logger to clarify which
logger each argument accepts; keep the docstring concise and consistent with the
existing style.

tests/unit_tests/training/test_state.py (1)

989-1074: Consider adding a positive test case for successful Comet logger initialization.

The test class covers disabled/None/wrong-rank/missing-name scenarios, but lacks a test verifying successful initialization on the last rank with valid config (similar to test_wandb_logger_property_enabled_rank_n_minus_1 at line 372 and test_mlflow_logger_property_enabled_rank_n_minus_1 at line 680).

💡 Example test case

def test_comet_logger_property_enabled_rank_n_minus_1(self):
    """Test comet logger enabled for rank N-1."""
    state = GlobalState()
    mock_config = MagicMock()
    mock_config.logger.comet_project = "test_project"
    mock_config.logger.comet_experiment_name = "test_experiment"
    mock_config.logger.comet_api_key = None
    mock_config.logger.comet_workspace = "test_workspace"
    mock_config.logger.comet_tags = ["tag1", "tag2"]
    mock_config.to_dict.return_value = {"config": "data"}
    state._cfg = mock_config

    mock_experiment = MagicMock()
    mock_comet_ml = MagicMock()
    mock_comet_ml.Experiment.return_value = mock_experiment

    with (
        patch("megatron.bridge.training.state.get_rank_safe", return_value=3),
        patch("megatron.bridge.training.state.get_world_size_safe", return_value=4),
        patch.dict("sys.modules", {"comet_ml": mock_comet_ml}),
    ):
        # Re-import to use patched comet_ml
        import importlib
        import megatron.bridge.training.state as state_module
        importlib.reload(state_module)
        
        state = state_module.GlobalState()
        state._cfg = mock_config
        
        with (
            patch("megatron.bridge.training.state.get_rank_safe", return_value=3),
            patch("megatron.bridge.training.state.get_world_size_safe", return_value=4),
        ):
            logger = state.comet_logger
            
            mock_comet_ml.Experiment.assert_called_once()
            mock_experiment.set_name.assert_called_once_with("test_experiment")
            mock_experiment.add_tags.assert_called_once_with(["tag1", "tag2"])
            mock_experiment.log_parameters.assert_called_once()
            assert logger == mock_experiment

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_state.py` around lines 989 - 1074, Add a
positive unit test that verifies GlobalState.comet_logger initializes a Comet
Experiment on the last rank: create a MagicMock config with
logger.comet_project, logger.comet_experiment_name, logger.comet_workspace,
logger.comet_tags and to_dict(), patch get_rank_safe to return world_size-1 and
get_world_size_safe accordingly, inject a mocked comet_ml into sys.modules whose
Experiment returns a mock_experiment, reload megatron.bridge.training.state to
pick up the patched module, then instantiate GlobalState and assert
mock_comet_ml.Experiment was called, mock_experiment.set_name called with the
experiment name, mock_experiment.add_tags called with the tags,
mock_experiment.log_parameters called, and that state.comet_logger equals
mock_experiment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/training/checkpointing.py`:
- Around line 837-849: The comet_finalize_fn closure currently reads
train_state.step and state.comet_logger at execution time; snapshot the needed
values eagerly so the async callback uses the intended checkpoint context:
before defining comet_finalize_fn capture local variables like step_snapshot =
train_state.step and comet_logger_snapshot = state.comet_logger (and keep
checkpoint_name/save_dir as they already are), then have comet_finalize_fn call
comet_utils.on_save_checkpoint_success(checkpoint_name, save_dir, step_snapshot,
comet_logger=comet_logger_snapshot); register that closure via
async_save_request.add_finalize_fn as before.

In `@src/megatron/bridge/training/utils/comet_utils.py`:
- Around line 21-48: The on_save_checkpoint_success function currently ignores
the save_dir parameter causing inconsistency with wandb_utils.py and
mlflow_utils.py; keep the signature and update the implementation and docstring
to log save_dir as metadata to Comet ML (e.g., call
comet_logger.log_other("checkpoint_save_dir", str(Path(save_dir).resolve())) and
also include save_dir when logging the resolved checkpoint path/iteration),
ensure you still resolve checkpoint_path via Path and wrap logs in the existing
try/except, and update the docstring to state that save_dir is recorded to Comet
ML metadata; use the existing symbols on_save_checkpoint_success,
comet_logger.log_other, Path, and the except block for error handling.

---

Outside diff comments:
In `@src/megatron/bridge/training/state.py`:
- Around line 436-448: In reset_for_restart, call the proper cleanup methods on
the loggers before nulling them: if self._wandb_logger is set, call its finish()
method (handle/ignore exceptions) then set self._wandb_logger = None; similarly,
if self._comet_logger is set, call its end() method (handle/ignore exceptions)
then set self._comet_logger = None; keep the rest of the existing resets
(self._timers, self._train_state, etc.) unchanged and ensure these calls occur
prior to clearing those attributes in the reset_for_restart method.

In `@src/megatron/bridge/training/utils/train_utils.py`:
- Around line 459-467: The timers are being reset before some sinks (notably
Comet) get written; change the logic in train_utils.py so the reset occurs only
after the last sink is written or produce a single snapshot and fan it out.
Specifically, call timers.write(...) with reset=False (or capture a snapshot via
a non-resetting API) then call write_to_wandb, write_to_mlflow, write_to_comet
in sequence and finally perform a single timers.write(..., reset=True) (or
invoke the timers' reset once) so that write_to_comet (and other downstream
writers) receive the same data; update the use of reset_in_tb and the calls to
timers.write, timers.write_to_wandb, timers.write_to_mlflow, and
timers.write_to_comet accordingly.
- Around line 498-646: The Comet logging calls (comet_logger.log_metrics(...))
are incorrectly nested under the TensorBoard writer guards so Comet metrics are
dropped when TensorBoard is disabled; modify the metric-emission blocks (e.g.,
the blocks using writer.add_scalar(...) and comet_logger.log_metrics(...) around
memory_report, runtime_report, l2_report, loss_dict, etc.) to only gate
writer.add_scalar(...) with the writer condition while always executing
comet_logger.log_metrics(...) when comet_logger is present, and replicate the
same change for the other affected block mentioned (the similar guard around the
block at lines referenced in the review). Ensure you update the conditional
structure (use separate if writer: for writer.add_scalar calls and a separate if
comet_logger: for comet_logger.log_metrics calls) for functions/variables such
as writer.add_scalar, comet_logger.log_metrics, wandb_writer.log, and
mlflow_logger.log_metrics so Comet logging is no longer behind the TensorBoard
writer guard.

---

Nitpick comments:
In `@src/megatron/bridge/training/tensor_inspect.py`:
- Around line 75-78: The docstring for _maybe_attach_metric_loggers currently
mentions only TensorBoard and W&B; update it to include Comet ML so it
accurately describes all supported metric loggers (TensorBoard, W&B raw module,
and Comet ML) and reference the parameters tensorboard_logger, wandb_logger, and
comet_logger to clarify which logger each argument accepts; keep the docstring
concise and consistent with the existing style.

In `@tests/unit_tests/training/test_config.py`:
- Around line 2913-2929: The positive Comet tests currently only check for no
exception; update test_finalize_with_comet_workspace_only and
test_finalize_with_all_comet_settings to assert that the patched
importlib.import_module mock was actually called (e.g., get the mock from the
patch context and assert_called/assert_called_once), and add two new tests that
construct LoggerConfig with only comet_api_key and only comet_tags respectively
and likewise patch importlib.import_module and assert it was exercised; locate
references to LoggerConfig.finalize and the test functions
test_finalize_with_comet_workspace_only/test_finalize_with_all_comet_settings to
modify and add these assertions and new test cases.

In `@tests/unit_tests/training/test_state.py`:
- Around line 989-1074: Add a positive unit test that verifies
GlobalState.comet_logger initializes a Comet Experiment on the last rank: create
a MagicMock config with logger.comet_project, logger.comet_experiment_name,
logger.comet_workspace, logger.comet_tags and to_dict(), patch get_rank_safe to
return world_size-1 and get_world_size_safe accordingly, inject a mocked
comet_ml into sys.modules whose Experiment returns a mock_experiment, reload
megatron.bridge.training.state to pick up the patched module, then instantiate
GlobalState and assert mock_comet_ml.Experiment was called,
mock_experiment.set_name called with the experiment name,
mock_experiment.add_tags called with the tags, mock_experiment.log_parameters
called, and that state.comet_logger equals mock_experiment.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 312df4cc-c897-4181-bdd7-faed3c692fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 94a90b2 and 6dabdbe.

📒 Files selected for processing (15)

docs/training/logging.md
pyproject.toml
src/megatron/bridge/recipes/run_plugins.py
src/megatron/bridge/training/checkpointing.py
src/megatron/bridge/training/config.py
src/megatron/bridge/training/eval.py
src/megatron/bridge/training/setup.py
src/megatron/bridge/training/state.py
src/megatron/bridge/training/tensor_inspect.py
src/megatron/bridge/training/train.py
src/megatron/bridge/training/utils/comet_utils.py
src/megatron/bridge/training/utils/train_utils.py
tests/unit_tests/training/test_config.py
tests/unit_tests/training/test_state.py
tests/unit_tests/training/utils/test_comet_utils.py

coderabbitai · 2026-03-20T00:48:39Z

+        def comet_finalize_fn() -> None:
+            comet_utils.on_save_checkpoint_success(
+                checkpoint_name,
+                save_dir,
+                train_state.step,
+                comet_logger=state.comet_logger,
+            )
+
        if ckpt_cfg.async_save:
            assert async_save_request is not None
            async_save_request.add_finalize_fn(wandb_finalize_fn)
            async_save_request.add_finalize_fn(mlflow_finalize_fn)
+            async_save_request.add_finalize_fn(comet_finalize_fn)


⚠️ Potential issue | 🟠 Major

Snapshot the Comet run and checkpoint step before registering the callback.

In async saves this closure executes later, so train_state.step may already refer to a newer step than checkpoint_name, and state.comet_logger is resolved after any logger lifecycle changes. Capture both values eagerly and pass the snapshots to on_save_checkpoint_success().

Suggested change

+ checkpoint_step = train_state.step + current_comet_logger = state.comet_logger + def comet_finalize_fn() -> None: comet_utils.on_save_checkpoint_success( checkpoint_name, save_dir, - train_state.step, - comet_logger=state.comet_logger, + checkpoint_step, + comet_logger=current_comet_logger, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/checkpointing.py` around lines 837 - 849, The comet_finalize_fn closure currently reads train_state.step and state.comet_logger at execution time; snapshot the needed values eagerly so the async callback uses the intended checkpoint context: before defining comet_finalize_fn capture local variables like step_snapshot = train_state.step and comet_logger_snapshot = state.comet_logger (and keep checkpoint_name/save_dir as they already are), then have comet_finalize_fn call comet_utils.on_save_checkpoint_success(checkpoint_name, save_dir, step_snapshot, comet_logger=comet_logger_snapshot); register that closure via async_save_request.add_finalize_fn as before.

coderabbitai · 2026-03-20T00:48:39Z

+def on_save_checkpoint_success(
+    checkpoint_path: str,
+    save_dir: str,
+    iteration: int,
+    comet_logger: Optional[Any],
+) -> None:
+    """Callback executed after a checkpoint is successfully saved.
+
+    If a Comet ML experiment is provided, records the checkpoint path and iteration
+    as experiment metadata.
+
+    Args:
+        checkpoint_path: The path to the specific checkpoint file/directory saved.
+        save_dir: The base directory where checkpoints are being saved.
+        iteration: The training iteration at which the checkpoint was saved.
+        comet_logger: The Comet ML Experiment instance.
+                      If None, this function is a no-op.
+    """
+    if comet_logger is None:
+        return
+
+    try:
+        resolved_ckpt = str(Path(checkpoint_path).resolve())
+        comet_logger.log_other("last_saved_checkpoint", resolved_ckpt)
+        comet_logger.log_other("last_saved_iteration", iteration)
+    except Exception as exc:
+        print_rank_last(f"Failed to log checkpoint information to Comet ML: {exc}")
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check how wandb_utils and mlflow_utils use save_dir in their on_save_checkpoint_success rg -n "on_save_checkpoint_success" --type=py -A 20 src/megatron/bridge/training/utils/wandb_utils.py src/megatron/bridge/training/utils/mlflow_utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3939

🏁 Script executed:

#!/bin/bash # Get complete wandb implementation echo "=== WANDB UTILS ===" rg -n "on_save_checkpoint_success" --type=py -A 30 src/megatron/bridge/training/utils/wandb_utils.py echo "" echo "=== MLFLOW UTILS ===" # Get complete mlflow implementation rg -n "on_save_checkpoint_success" --type=py -A 30 src/megatron/bridge/training/utils/mlflow_utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3177

save_dir parameter is unused and inconsistent with other logger implementations.

The save_dir parameter is documented but never used in the function body, while both wandb_utils.py and mlflow_utils.py actively use it to construct artifact metadata. Either use save_dir for additional Comet ML logging (to maintain API consistency with other backends) or remove it from the signature.

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 46-46: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/utils/comet_utils.py` around lines 21 - 48, The on_save_checkpoint_success function currently ignores the save_dir parameter causing inconsistency with wandb_utils.py and mlflow_utils.py; keep the signature and update the implementation and docstring to log save_dir as metadata to Comet ML (e.g., call comet_logger.log_other("checkpoint_save_dir", str(Path(save_dir).resolve())) and also include save_dir when logging the resolved checkpoint path/iteration), ensure you still resolve checkpoint_path via Path and wrap logs in the existing try/except, and update the docstring to state that save_dir is recorded to Comet ML metadata; use the existing symbols on_save_checkpoint_success, comet_logger.log_other, Path, and the except block for error handling.

Signed-off-by: Shane Moran <shane.moran@shopify.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Shane Moran <shane.moran@shopify.com> Signed-off-by: Li Ding <liding@nvidia.com>

shanecmoran and others added 6 commits March 12, 2026 10:58

add comet-ml to project dependencies

69725d1

wandb and mlflow are both in the main dependency list. Add comet-ml to match, since it's now a first-class logging backend. Signed-off-by: Shane Moran <shane.moran@shopify.com>

chore: update uv.lock for comet-ml dependency

0d9f1c4

Signed-off-by: Shane Moran <shane.moran@shopify.com>

test: Add @pytest.mark.unit markers to Comet ML test classes

d40490d

Adds missing markers to TestCometLoggerProperty and TestTimersWriteToComet in test_state.py for consistency with test_comet_utils.py. Signed-off-by: Shane Moran <shane.moran@shopify.com>

Merge remote-tracking branch 'origin/main' into feat/comet-ml-logging

6dabdbe

# Conflicts: # uv.lock

cuichenx requested a review from a team as a code owner March 20, 2026 00:34

cuichenx marked this pull request as draft March 20, 2026 00:37

chore: update uv.lock for comet-ml dependency

2771522

Regenerate uv.lock to include comet-ml and its transitive dependencies (configobj, dulwich, everett, python-box, requests-toolbelt, semantic-version, simplejson, wurlitzer). Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as ready for review March 20, 2026 00:40

copy-pr-bot Bot temporarily deployed to test March 20, 2026 00:41 Inactive

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci March 20, 2026 00:50 Inactive

cuichenx added area:training Training loop, callbacks, and runtime integration ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed area:training Training loop, callbacks, and runtime integration labels Mar 20, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci March 20, 2026 01:27 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 20, 2026 01:35 Inactive

yaoyu-33 merged commit 80a8070 into main Mar 20, 2026
37 of 38 checks passed

yaoyu-33 deleted the feat/comet-ml-logging branch March 20, 2026 17:57

LoganVegnaSHOP mentioned this pull request Mar 25, 2026

track_moe_metrics() does not forward MoE metrics to Comet ML #2989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add first-class Comet ML experiment tracking#2910

feat: Add first-class Comet ML experiment tracking#2910
yaoyu-33 merged 7 commits into
mainfrom
feat/comet-ml-logging

cuichenx commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Mar 20, 2026

Uh oh!

cuichenx commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Uh oh!

coderabbitai Bot Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cuichenx commented Mar 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Mar 20, 2026

Uh oh!

cuichenx commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cuichenx commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading