Skip to content

feat: Add first-class Comet ML experiment tracking#2910

Merged
yaoyu-33 merged 7 commits into
mainfrom
feat/comet-ml-logging
Mar 20, 2026
Merged

feat: Add first-class Comet ML experiment tracking#2910
yaoyu-33 merged 7 commits into
mainfrom
feat/comet-ml-logging

Conversation

@cuichenx

@cuichenx cuichenx commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds Comet ML as a fourth experiment tracking backend alongside TensorBoard, Weights & Biases, and MLflow, following the same structural pattern
  • New config fields (comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags) in LoggerConfig with finalize() validation
  • GlobalState.comet_logger property lazily initializes a comet_ml.Experiment on the last rank, logs full training config as parameters, and supports tags
  • All ~18 metric logging call sites in training_log() now dispatch to Comet alongside WandB and MLflow; validation metrics in eval.py also logged
  • Timer metrics via _timers_write_to_comet (no metric name sanitization — Comet supports / natively)
  • Checkpoint tracking callbacks in comet_utils.py wired into checkpointing.py
  • _CometExperimentLogger wrapper for NVIDIA DLFw Inspect MetricLogger
  • CometPlugin added to run_plugins.py for NeMo Run launching
  • Documentation added to docs/training/logging.md

Fixes: #2652

Based on @shanecmoran's original PR #2653, rebased onto current main with uv.lock conflict resolved.

Test plan

  • Unit tests for LoggerConfig comet fields (test_config.py)
  • Unit tests for GlobalState.comet_logger property and _timers_write_to_comet (test_state.py)
  • Unit tests for comet_utils.py checkpoint callbacks (test_comet_utils.py)
  • Functional test: run training with comet_project set and verify metrics appear in Comet dashboard

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added Comet ML as a metrics and metadata backend for training, alongside TensorBoard, W&B, and MLflow
    • Support for configuring Comet ML project, experiment name, workspace, and tags through LoggerConfig
  • Documentation

    • Added Comet ML logging guide with setup instructions and configuration details
  • Tests

    • Added unit tests for Comet ML integration

shanecmoran and others added 6 commits March 12, 2026 10:58
Add Comet ML as a fourth experiment tracking backend alongside
TensorBoard, Weights & Biases, and MLflow. The integration follows
the exact same structural pattern used by the existing backends.

Config fields: comet_project, comet_experiment_name, comet_workspace,
comet_api_key, comet_tags in LoggerConfig with finalize() validation.

GlobalState.comet_logger property lazily initializes a
comet_ml.Experiment on the last rank, logs the full training config
as parameters, and supports tags.

Training metrics: all ~18 metric logging call sites in training_log()
now dispatch to Comet alongside WandB and MLflow. Validation metrics
in eval.py are also logged. Timer metrics use a new
_timers_write_to_comet patch (no metric name sanitization since Comet
supports / in names natively).

Checkpoint tracking: comet_utils.py provides on_save_checkpoint_success
and on_load_checkpoint_success callbacks wired into checkpointing.py.

Tensor inspect: _CometExperimentLogger wrapper registered with NVIDIA
DLFw Inspect MetricLogger system.

CometPlugin added to run_plugins.py for NeMo Run launching.

Documentation added to docs/training/logging.md.

Fixes: #2652

Signed-off-by: Shane Moran <shane.moran@shopify.com>
- Include comet_api_key and comet_tags in using_comet detection
- Use truthy check for comet_experiment_name (handles None and "")
- Strip whitespace from api_key before passing to Experiment
- Use backing attribute _comet_logger during teardown to avoid
  accidental lazy initialization
- Add stacklevel=2 to timer warning for better caller diagnostics

Signed-off-by: Shane Moran <shane.moran@shopify.com>
wandb and mlflow are both in the main dependency list. Add comet-ml
to match, since it's now a first-class logging backend.

Signed-off-by: Shane Moran <shane.moran@shopify.com>
Signed-off-by: Shane Moran <shane.moran@shopify.com>
Adds missing markers to TestCometLoggerProperty and
TestTimersWriteToComet in test_state.py for consistency with
test_comet_utils.py.

Signed-off-by: Shane Moran <shane.moran@shopify.com>
@cuichenx cuichenx requested a review from a team as a code owner March 20, 2026 00:34
@copy-pr-bot

copy-pr-bot Bot commented Mar 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cuichenx cuichenx marked this pull request as draft March 20, 2026 00:37
Regenerate uv.lock to include comet-ml and its transitive dependencies
(configobj, dulwich, everett, python-box, requests-toolbelt,
semantic-version, simplejson, wurlitzer).

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx marked this pull request as ready for review March 20, 2026 00:40
@cuichenx

Copy link
Copy Markdown
Contributor Author

/ok to test 2771522

@coderabbitai

coderabbitai Bot commented Mar 20, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Added first-class Comet ML integration alongside existing TensorBoard, W&B, and MLflow backends. Includes configuration fields, lazy-initialized logger in GlobalState, training/validation/timer/checkpoint metric logging, tensor inspection support, NeMo Run plugin, and comprehensive unit tests.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
pyproject.toml, src/megatron/bridge/training/config.py
Added comet_ml>=3.50.0 dependency and extended LoggerConfig with Comet fields (comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags); added validation requiring non-empty comet_experiment_name when comet_project is set, and import checking for comet_ml package.
State Management & Initialization
src/megatron/bridge/training/state.py, src/megatron/bridge/training/setup.py
Added lazily-initialized comet_logger property to GlobalState (last-rank only, stores comet_ml.Experiment); extended timers initialization with write_to_comet method; added _timers_write_to_comet() for aggregating and logging timer metrics; updated reset_for_restart() to clear Comet logger.
Training Metrics & Logging
src/megatron/bridge/training/train_utils.py, src/megatron/bridge/training/eval.py
Extended training_log() to emit all scalar/throughput/memory/performance metrics to Comet alongside WandB/MLflow; added Comet metric logging in evaluate_and_print_results() for validation loss and perplexity.
Checkpoint Lifecycle
src/megatron/bridge/training/checkpointing.py, src/megatron/bridge/training/utils/comet_utils.py
Hooked Comet finalization callbacks into async/sync checkpoint save flows; added comet_utils module with on_save_checkpoint_success() and on_load_checkpoint_success() for logging checkpoint metadata (last_saved_checkpoint, last_loaded_checkpoint) to Comet.
Tensor Inspection Integration
src/megatron/bridge/training/tensor_inspect.py
Extended finalize_tensor_inspect_post_model_initialization() to accept optional comet_logger parameter; added internal _CometExperimentLogger class to attach Comet as metric backend alongside TensorBoard/W&B.
NeMo Run Plugin
src/megatron/bridge/recipes/run_plugins.py
Added CometPlugin to inject COMET_API_KEY into executor environment and generate CLI argument overrides for logger.comet_project, logger.comet_workspace, and logger.comet_experiment_name when launching training scripts.
Training Cleanup
src/megatron/bridge/training/train.py
Added explicit teardown by calling comet_logger.end() in both main training cleanup path and _finish_train() function.
Documentation
docs/training/logging.md
Added "Comet ML" section documenting configuration fields, installation, authentication via COMET_API_KEY, lazy initialization behavior, NeMo Run integration, and logged metrics/metadata.
Unit Tests
tests/unit_tests/training/test_config.py, tests/unit_tests/training/test_state.py, tests/unit_tests/training/utils/test_comet_utils.py
Added comprehensive test suites covering LoggerConfig.finalize() validation, GlobalState.comet_logger property initialization (rank gating, error handling), _timers_write_to_comet() metric aggregation, and comet_utils checkpoint callback functions.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Training as Training Loop
    participant GlobalState
    participant TrainingLog
    participant Checkpointing
    participant CometML as Comet ML

    User->>Training: start training
    Training->>GlobalState: initialize (comet_logger property)
    GlobalState->>CometML: create Experiment
    CometML-->>GlobalState: return Experiment instance
    
    loop Training Steps
        Training->>TrainingLog: log metrics
        TrainingLog->>CometML: log_metrics(scalars, throughput, memory, etc.)
        TrainingLog->>CometML: write_to_comet() for timers
    end
    
    loop Validation Steps
        Training->>Training: evaluate_and_print_results()
        Training->>CometML: log_metrics(validation_loss, ppl)
    end
    
    Training->>Checkpointing: save checkpoint
    Checkpointing->>CometML: log_other(last_saved_checkpoint, iteration)
    
    Training->>GlobalState: cleanup (end comet_logger)
    GlobalState->>CometML: end() experiment
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: Add first-class Comet ML experiment tracking' accurately summarizes the main change: adding Comet ML as a first-class experiment tracking backend alongside existing backends.
Linked Issues check ✅ Passed All coding objectives from issue #2652 are fully implemented: config fields, lazy-initialized comet_logger, training/validation/timer metrics logging, checkpoint callbacks, tensor inspect integration, run plugin, and documentation.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing first-class Comet ML integration as specified in #2652. No unrelated modifications or unnecessary refactoring detected.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR documentation comprehensively documents ~397 lines of unit test code for the major Comet ML integration feature, covering config validation, logger initialization, timers, and checkpoint callbacks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/comet-ml-logging
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/megatron/bridge/training/utils/train_utils.py (2)

459-467: ⚠️ Potential issue | 🟠 Major

Comet timer metrics are reset before they’re written.

The surrounding reset logic only accounts for the W&B path. In a Comet-only run, timers.write(..., reset=True) clears the interval before write_to_comet() executes; with W&B enabled, write_to_wandb(..., reset=True) does the same. Defer the reset until the last sink or fan out a single timer snapshot.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/train_utils.py` around lines 459 - 467,
The timers are being reset before some sinks (notably Comet) get written; change
the logic in train_utils.py so the reset occurs only after the last sink is
written or produce a single snapshot and fan it out. Specifically, call
timers.write(...) with reset=False (or capture a snapshot via a non-resetting
API) then call write_to_wandb, write_to_mlflow, write_to_comet in sequence and
finally perform a single timers.write(..., reset=True) (or invoke the timers'
reset once) so that write_to_comet (and other downstream writers) receive the
same data; update the use of reset_in_tb and the calls to timers.write,
timers.write_to_wandb, timers.write_to_mlflow, and timers.write_to_comet
accordingly.

498-646: ⚠️ Potential issue | 🟠 Major

Don’t gate Comet metrics on a TensorBoard writer.

These new comet_logger.log_metrics(...) calls are still under the surrounding if writer and ... / if writer: guards. When TensorBoard is disabled, Comet silently drops these core metrics even though the experiment is active. Move the interval guard outside and only gate the writer.add_scalar(...) calls.

Also applies to: 713-720

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/train_utils.py` around lines 498 - 646,
The Comet logging calls (comet_logger.log_metrics(...)) are incorrectly nested
under the TensorBoard writer guards so Comet metrics are dropped when
TensorBoard is disabled; modify the metric-emission blocks (e.g., the blocks
using writer.add_scalar(...) and comet_logger.log_metrics(...) around
memory_report, runtime_report, l2_report, loss_dict, etc.) to only gate
writer.add_scalar(...) with the writer condition while always executing
comet_logger.log_metrics(...) when comet_logger is present, and replicate the
same change for the other affected block mentioned (the similar guard around the
block at lines referenced in the review). Ensure you update the conditional
structure (use separate if writer: for writer.add_scalar calls and a separate if
comet_logger: for comet_logger.log_metrics calls) for functions/variables such
as writer.add_scalar, comet_logger.log_metrics, wandb_writer.log, and
mlflow_logger.log_metrics so Comet logging is no longer behind the TensorBoard
writer guard.
src/megatron/bridge/training/state.py (1)

436-448: ⚠️ Potential issue | 🟠 Major

Call cleanup methods on W&B and Comet loggers in reset_for_restart before clearing them.

The reset_for_restart method clears both _wandb_logger and _comet_logger without calling their respective cleanup methods. In the normal teardown paths in train.py (lines 644-645 and 1320-1323), .finish() is called on the W&B logger and .end() is called on the Comet experiment before clearing. During an in-process restart, these experiments may be abandoned without proper closure, potentially losing buffered metrics.

🔧 Suggested fix
 def reset_for_restart(self) -> None:
     """Reset GlobalState components for in-process restart.

     This cleans up all stateful components that need to be reinitialized between restart iterations.
     The async calls queue for checkpointing is handled separately in aborting in order to clean up persistent workers.
     """
     self._timers = None
     self._train_state = None
     self._tensorboard_logger = None
+    if self._wandb_logger is not None:
+        self._wandb_logger.finish()
     self._wandb_logger = None
     self._mlflow_logger = None
+    if self._comet_logger is not None:
+        self._comet_logger.end()
     self._comet_logger = None
     self._energy_monitor = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/state.py` around lines 436 - 448, In
reset_for_restart, call the proper cleanup methods on the loggers before nulling
them: if self._wandb_logger is set, call its finish() method (handle/ignore
exceptions) then set self._wandb_logger = None; similarly, if self._comet_logger
is set, call its end() method (handle/ignore exceptions) then set
self._comet_logger = None; keep the rest of the existing resets (self._timers,
self._train_state, etc.) unchanged and ensure these calls occur prior to
clearing those attributes in the reset_for_restart method.
🧹 Nitpick comments (3)
tests/unit_tests/training/test_config.py (1)

2913-2929: Make the positive Comet tests prove the branch ran.

These cases only assert “no exception”. If finalize() stops treating comet_workspace, comet_api_key, or comet_tags as Comet activators, they still pass. Please assert that the patched importlib.import_module was exercised, and add standalone comet_api_key / comet_tags cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_config.py` around lines 2913 - 2929, The
positive Comet tests currently only check for no exception; update
test_finalize_with_comet_workspace_only and
test_finalize_with_all_comet_settings to assert that the patched
importlib.import_module mock was actually called (e.g., get the mock from the
patch context and assert_called/assert_called_once), and add two new tests that
construct LoggerConfig with only comet_api_key and only comet_tags respectively
and likewise patch importlib.import_module and assert it was exercised; locate
references to LoggerConfig.finalize and the test functions
test_finalize_with_comet_workspace_only/test_finalize_with_all_comet_settings to
modify and add these assertions and new test cases.
src/megatron/bridge/training/tensor_inspect.py (1)

75-78: Update docstring to include Comet ML.

The docstring still references only TensorBoard and W&B, but the function now also attaches Comet ML loggers.

📝 Suggested docstring update
 def _maybe_attach_metric_loggers(
     tensorboard_logger: Any | None, wandb_logger: Any | None, comet_logger: Any | None
 ) -> None:
-    """Attach supported metric loggers (TensorBoard, W&B raw module)."""
+    """Attach supported metric loggers (TensorBoard, W&B, Comet ML)."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/tensor_inspect.py` around lines 75 - 78, The
docstring for _maybe_attach_metric_loggers currently mentions only TensorBoard
and W&B; update it to include Comet ML so it accurately describes all supported
metric loggers (TensorBoard, W&B raw module, and Comet ML) and reference the
parameters tensorboard_logger, wandb_logger, and comet_logger to clarify which
logger each argument accepts; keep the docstring concise and consistent with the
existing style.
tests/unit_tests/training/test_state.py (1)

989-1074: Consider adding a positive test case for successful Comet logger initialization.

The test class covers disabled/None/wrong-rank/missing-name scenarios, but lacks a test verifying successful initialization on the last rank with valid config (similar to test_wandb_logger_property_enabled_rank_n_minus_1 at line 372 and test_mlflow_logger_property_enabled_rank_n_minus_1 at line 680).

💡 Example test case
def test_comet_logger_property_enabled_rank_n_minus_1(self):
    """Test comet logger enabled for rank N-1."""
    state = GlobalState()
    mock_config = MagicMock()
    mock_config.logger.comet_project = "test_project"
    mock_config.logger.comet_experiment_name = "test_experiment"
    mock_config.logger.comet_api_key = None
    mock_config.logger.comet_workspace = "test_workspace"
    mock_config.logger.comet_tags = ["tag1", "tag2"]
    mock_config.to_dict.return_value = {"config": "data"}
    state._cfg = mock_config

    mock_experiment = MagicMock()
    mock_comet_ml = MagicMock()
    mock_comet_ml.Experiment.return_value = mock_experiment

    with (
        patch("megatron.bridge.training.state.get_rank_safe", return_value=3),
        patch("megatron.bridge.training.state.get_world_size_safe", return_value=4),
        patch.dict("sys.modules", {"comet_ml": mock_comet_ml}),
    ):
        # Re-import to use patched comet_ml
        import importlib
        import megatron.bridge.training.state as state_module
        importlib.reload(state_module)
        
        state = state_module.GlobalState()
        state._cfg = mock_config
        
        with (
            patch("megatron.bridge.training.state.get_rank_safe", return_value=3),
            patch("megatron.bridge.training.state.get_world_size_safe", return_value=4),
        ):
            logger = state.comet_logger
            
            mock_comet_ml.Experiment.assert_called_once()
            mock_experiment.set_name.assert_called_once_with("test_experiment")
            mock_experiment.add_tags.assert_called_once_with(["tag1", "tag2"])
            mock_experiment.log_parameters.assert_called_once()
            assert logger == mock_experiment
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_state.py` around lines 989 - 1074, Add a
positive unit test that verifies GlobalState.comet_logger initializes a Comet
Experiment on the last rank: create a MagicMock config with
logger.comet_project, logger.comet_experiment_name, logger.comet_workspace,
logger.comet_tags and to_dict(), patch get_rank_safe to return world_size-1 and
get_world_size_safe accordingly, inject a mocked comet_ml into sys.modules whose
Experiment returns a mock_experiment, reload megatron.bridge.training.state to
pick up the patched module, then instantiate GlobalState and assert
mock_comet_ml.Experiment was called, mock_experiment.set_name called with the
experiment name, mock_experiment.add_tags called with the tags,
mock_experiment.log_parameters called, and that state.comet_logger equals
mock_experiment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/training/checkpointing.py`:
- Around line 837-849: The comet_finalize_fn closure currently reads
train_state.step and state.comet_logger at execution time; snapshot the needed
values eagerly so the async callback uses the intended checkpoint context:
before defining comet_finalize_fn capture local variables like step_snapshot =
train_state.step and comet_logger_snapshot = state.comet_logger (and keep
checkpoint_name/save_dir as they already are), then have comet_finalize_fn call
comet_utils.on_save_checkpoint_success(checkpoint_name, save_dir, step_snapshot,
comet_logger=comet_logger_snapshot); register that closure via
async_save_request.add_finalize_fn as before.

In `@src/megatron/bridge/training/utils/comet_utils.py`:
- Around line 21-48: The on_save_checkpoint_success function currently ignores
the save_dir parameter causing inconsistency with wandb_utils.py and
mlflow_utils.py; keep the signature and update the implementation and docstring
to log save_dir as metadata to Comet ML (e.g., call
comet_logger.log_other("checkpoint_save_dir", str(Path(save_dir).resolve())) and
also include save_dir when logging the resolved checkpoint path/iteration),
ensure you still resolve checkpoint_path via Path and wrap logs in the existing
try/except, and update the docstring to state that save_dir is recorded to Comet
ML metadata; use the existing symbols on_save_checkpoint_success,
comet_logger.log_other, Path, and the except block for error handling.

---

Outside diff comments:
In `@src/megatron/bridge/training/state.py`:
- Around line 436-448: In reset_for_restart, call the proper cleanup methods on
the loggers before nulling them: if self._wandb_logger is set, call its finish()
method (handle/ignore exceptions) then set self._wandb_logger = None; similarly,
if self._comet_logger is set, call its end() method (handle/ignore exceptions)
then set self._comet_logger = None; keep the rest of the existing resets
(self._timers, self._train_state, etc.) unchanged and ensure these calls occur
prior to clearing those attributes in the reset_for_restart method.

In `@src/megatron/bridge/training/utils/train_utils.py`:
- Around line 459-467: The timers are being reset before some sinks (notably
Comet) get written; change the logic in train_utils.py so the reset occurs only
after the last sink is written or produce a single snapshot and fan it out.
Specifically, call timers.write(...) with reset=False (or capture a snapshot via
a non-resetting API) then call write_to_wandb, write_to_mlflow, write_to_comet
in sequence and finally perform a single timers.write(..., reset=True) (or
invoke the timers' reset once) so that write_to_comet (and other downstream
writers) receive the same data; update the use of reset_in_tb and the calls to
timers.write, timers.write_to_wandb, timers.write_to_mlflow, and
timers.write_to_comet accordingly.
- Around line 498-646: The Comet logging calls (comet_logger.log_metrics(...))
are incorrectly nested under the TensorBoard writer guards so Comet metrics are
dropped when TensorBoard is disabled; modify the metric-emission blocks (e.g.,
the blocks using writer.add_scalar(...) and comet_logger.log_metrics(...) around
memory_report, runtime_report, l2_report, loss_dict, etc.) to only gate
writer.add_scalar(...) with the writer condition while always executing
comet_logger.log_metrics(...) when comet_logger is present, and replicate the
same change for the other affected block mentioned (the similar guard around the
block at lines referenced in the review). Ensure you update the conditional
structure (use separate if writer: for writer.add_scalar calls and a separate if
comet_logger: for comet_logger.log_metrics calls) for functions/variables such
as writer.add_scalar, comet_logger.log_metrics, wandb_writer.log, and
mlflow_logger.log_metrics so Comet logging is no longer behind the TensorBoard
writer guard.

---

Nitpick comments:
In `@src/megatron/bridge/training/tensor_inspect.py`:
- Around line 75-78: The docstring for _maybe_attach_metric_loggers currently
mentions only TensorBoard and W&B; update it to include Comet ML so it
accurately describes all supported metric loggers (TensorBoard, W&B raw module,
and Comet ML) and reference the parameters tensorboard_logger, wandb_logger, and
comet_logger to clarify which logger each argument accepts; keep the docstring
concise and consistent with the existing style.

In `@tests/unit_tests/training/test_config.py`:
- Around line 2913-2929: The positive Comet tests currently only check for no
exception; update test_finalize_with_comet_workspace_only and
test_finalize_with_all_comet_settings to assert that the patched
importlib.import_module mock was actually called (e.g., get the mock from the
patch context and assert_called/assert_called_once), and add two new tests that
construct LoggerConfig with only comet_api_key and only comet_tags respectively
and likewise patch importlib.import_module and assert it was exercised; locate
references to LoggerConfig.finalize and the test functions
test_finalize_with_comet_workspace_only/test_finalize_with_all_comet_settings to
modify and add these assertions and new test cases.

In `@tests/unit_tests/training/test_state.py`:
- Around line 989-1074: Add a positive unit test that verifies
GlobalState.comet_logger initializes a Comet Experiment on the last rank: create
a MagicMock config with logger.comet_project, logger.comet_experiment_name,
logger.comet_workspace, logger.comet_tags and to_dict(), patch get_rank_safe to
return world_size-1 and get_world_size_safe accordingly, inject a mocked
comet_ml into sys.modules whose Experiment returns a mock_experiment, reload
megatron.bridge.training.state to pick up the patched module, then instantiate
GlobalState and assert mock_comet_ml.Experiment was called,
mock_experiment.set_name called with the experiment name,
mock_experiment.add_tags called with the tags, mock_experiment.log_parameters
called, and that state.comet_logger equals mock_experiment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 312df4cc-c897-4181-bdd7-faed3c692fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 94a90b2 and 6dabdbe.

📒 Files selected for processing (15)
  • docs/training/logging.md
  • pyproject.toml
  • src/megatron/bridge/recipes/run_plugins.py
  • src/megatron/bridge/training/checkpointing.py
  • src/megatron/bridge/training/config.py
  • src/megatron/bridge/training/eval.py
  • src/megatron/bridge/training/setup.py
  • src/megatron/bridge/training/state.py
  • src/megatron/bridge/training/tensor_inspect.py
  • src/megatron/bridge/training/train.py
  • src/megatron/bridge/training/utils/comet_utils.py
  • src/megatron/bridge/training/utils/train_utils.py
  • tests/unit_tests/training/test_config.py
  • tests/unit_tests/training/test_state.py
  • tests/unit_tests/training/utils/test_comet_utils.py

Comment on lines +837 to +849
def comet_finalize_fn() -> None:
comet_utils.on_save_checkpoint_success(
checkpoint_name,
save_dir,
train_state.step,
comet_logger=state.comet_logger,
)

if ckpt_cfg.async_save:
assert async_save_request is not None
async_save_request.add_finalize_fn(wandb_finalize_fn)
async_save_request.add_finalize_fn(mlflow_finalize_fn)
async_save_request.add_finalize_fn(comet_finalize_fn)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Snapshot the Comet run and checkpoint step before registering the callback.

In async saves this closure executes later, so train_state.step may already refer to a newer step than checkpoint_name, and state.comet_logger is resolved after any logger lifecycle changes. Capture both values eagerly and pass the snapshots to on_save_checkpoint_success().

Suggested change
+        checkpoint_step = train_state.step
+        current_comet_logger = state.comet_logger
+
         def comet_finalize_fn() -> None:
             comet_utils.on_save_checkpoint_success(
                 checkpoint_name,
                 save_dir,
-                train_state.step,
-                comet_logger=state.comet_logger,
+                checkpoint_step,
+                comet_logger=current_comet_logger,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/checkpointing.py` around lines 837 - 849, The
comet_finalize_fn closure currently reads train_state.step and
state.comet_logger at execution time; snapshot the needed values eagerly so the
async callback uses the intended checkpoint context: before defining
comet_finalize_fn capture local variables like step_snapshot = train_state.step
and comet_logger_snapshot = state.comet_logger (and keep
checkpoint_name/save_dir as they already are), then have comet_finalize_fn call
comet_utils.on_save_checkpoint_success(checkpoint_name, save_dir, step_snapshot,
comet_logger=comet_logger_snapshot); register that closure via
async_save_request.add_finalize_fn as before.

Comment on lines +21 to +48
def on_save_checkpoint_success(
checkpoint_path: str,
save_dir: str,
iteration: int,
comet_logger: Optional[Any],
) -> None:
"""Callback executed after a checkpoint is successfully saved.

If a Comet ML experiment is provided, records the checkpoint path and iteration
as experiment metadata.

Args:
checkpoint_path: The path to the specific checkpoint file/directory saved.
save_dir: The base directory where checkpoints are being saved.
iteration: The training iteration at which the checkpoint was saved.
comet_logger: The Comet ML Experiment instance.
If None, this function is a no-op.
"""
if comet_logger is None:
return

try:
resolved_ckpt = str(Path(checkpoint_path).resolve())
comet_logger.log_other("last_saved_checkpoint", resolved_ckpt)
comet_logger.log_other("last_saved_iteration", iteration)
except Exception as exc:
print_rank_last(f"Failed to log checkpoint information to Comet ML: {exc}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how wandb_utils and mlflow_utils use save_dir in their on_save_checkpoint_success
rg -n "on_save_checkpoint_success" --type=py -A 20 src/megatron/bridge/training/utils/wandb_utils.py src/megatron/bridge/training/utils/mlflow_utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3939


🏁 Script executed:

#!/bin/bash
# Get complete wandb implementation
echo "=== WANDB UTILS ===" 
rg -n "on_save_checkpoint_success" --type=py -A 30 src/megatron/bridge/training/utils/wandb_utils.py

echo ""
echo "=== MLFLOW UTILS ===" 
# Get complete mlflow implementation
rg -n "on_save_checkpoint_success" --type=py -A 30 src/megatron/bridge/training/utils/mlflow_utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3177


save_dir parameter is unused and inconsistent with other logger implementations.

The save_dir parameter is documented but never used in the function body, while both wandb_utils.py and mlflow_utils.py actively use it to construct artifact metadata. Either use save_dir for additional Comet ML logging (to maintain API consistency with other backends) or remove it from the signature.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 46-46: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/comet_utils.py` around lines 21 - 48, The
on_save_checkpoint_success function currently ignores the save_dir parameter
causing inconsistency with wandb_utils.py and mlflow_utils.py; keep the
signature and update the implementation and docstring to log save_dir as
metadata to Comet ML (e.g., call comet_logger.log_other("checkpoint_save_dir",
str(Path(save_dir).resolve())) and also include save_dir when logging the
resolved checkpoint path/iteration), ensure you still resolve checkpoint_path
via Path and wrap logs in the existing try/except, and update the docstring to
state that save_dir is recorded to Comet ML metadata; use the existing symbols
on_save_checkpoint_success, comet_logger.log_other, Path, and the except block
for error handling.

@cuichenx cuichenx added area:training Training loop, callbacks, and runtime integration ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed area:training Training loop, callbacks, and runtime integration labels Mar 20, 2026
@yaoyu-33 yaoyu-33 merged commit 80a8070 into main Mar 20, 2026
37 of 38 checks passed
@yaoyu-33 yaoyu-33 deleted the feat/comet-ml-logging branch March 20, 2026 17:57
liding-nv pushed a commit that referenced this pull request Mar 22, 2026
Signed-off-by: Shane Moran <shane.moran@shopify.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Shane Moran <shane.moran@shopify.com>
Signed-off-by: Li Ding <liding@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add first-class Comet ML experiment tracking

3 participants