feat: Add first-class Comet ML experiment tracking#2910
Conversation
Add Comet ML as a fourth experiment tracking backend alongside TensorBoard, Weights & Biases, and MLflow. The integration follows the exact same structural pattern used by the existing backends. Config fields: comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags in LoggerConfig with finalize() validation. GlobalState.comet_logger property lazily initializes a comet_ml.Experiment on the last rank, logs the full training config as parameters, and supports tags. Training metrics: all ~18 metric logging call sites in training_log() now dispatch to Comet alongside WandB and MLflow. Validation metrics in eval.py are also logged. Timer metrics use a new _timers_write_to_comet patch (no metric name sanitization since Comet supports / in names natively). Checkpoint tracking: comet_utils.py provides on_save_checkpoint_success and on_load_checkpoint_success callbacks wired into checkpointing.py. Tensor inspect: _CometExperimentLogger wrapper registered with NVIDIA DLFw Inspect MetricLogger system. CometPlugin added to run_plugins.py for NeMo Run launching. Documentation added to docs/training/logging.md. Fixes: #2652 Signed-off-by: Shane Moran <shane.moran@shopify.com>
- Include comet_api_key and comet_tags in using_comet detection - Use truthy check for comet_experiment_name (handles None and "") - Strip whitespace from api_key before passing to Experiment - Use backing attribute _comet_logger during teardown to avoid accidental lazy initialization - Add stacklevel=2 to timer warning for better caller diagnostics Signed-off-by: Shane Moran <shane.moran@shopify.com>
wandb and mlflow are both in the main dependency list. Add comet-ml to match, since it's now a first-class logging backend. Signed-off-by: Shane Moran <shane.moran@shopify.com>
Signed-off-by: Shane Moran <shane.moran@shopify.com>
Adds missing markers to TestCometLoggerProperty and TestTimersWriteToComet in test_state.py for consistency with test_comet_utils.py. Signed-off-by: Shane Moran <shane.moran@shopify.com>
# Conflicts: # uv.lock
Regenerate uv.lock to include comet-ml and its transitive dependencies (configobj, dulwich, everett, python-box, requests-toolbelt, semantic-version, simplejson, wurlitzer). Signed-off-by: Chen Cui <chcui@nvidia.com>
|
/ok to test 2771522 |
📝 WalkthroughWalkthroughAdded first-class Comet ML integration alongside existing TensorBoard, W&B, and MLflow backends. Includes configuration fields, lazy-initialized logger in GlobalState, training/validation/timer/checkpoint metric logging, tensor inspection support, NeMo Run plugin, and comprehensive unit tests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Training as Training Loop
participant GlobalState
participant TrainingLog
participant Checkpointing
participant CometML as Comet ML
User->>Training: start training
Training->>GlobalState: initialize (comet_logger property)
GlobalState->>CometML: create Experiment
CometML-->>GlobalState: return Experiment instance
loop Training Steps
Training->>TrainingLog: log metrics
TrainingLog->>CometML: log_metrics(scalars, throughput, memory, etc.)
TrainingLog->>CometML: write_to_comet() for timers
end
loop Validation Steps
Training->>Training: evaluate_and_print_results()
Training->>CometML: log_metrics(validation_loss, ppl)
end
Training->>Checkpointing: save checkpoint
Checkpointing->>CometML: log_other(last_saved_checkpoint, iteration)
Training->>GlobalState: cleanup (end comet_logger)
GlobalState->>CometML: end() experiment
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/megatron/bridge/training/utils/train_utils.py (2)
459-467:⚠️ Potential issue | 🟠 MajorComet timer metrics are reset before they’re written.
The surrounding reset logic only accounts for the W&B path. In a Comet-only run,
timers.write(..., reset=True)clears the interval beforewrite_to_comet()executes; with W&B enabled,write_to_wandb(..., reset=True)does the same. Defer the reset until the last sink or fan out a single timer snapshot.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/utils/train_utils.py` around lines 459 - 467, The timers are being reset before some sinks (notably Comet) get written; change the logic in train_utils.py so the reset occurs only after the last sink is written or produce a single snapshot and fan it out. Specifically, call timers.write(...) with reset=False (or capture a snapshot via a non-resetting API) then call write_to_wandb, write_to_mlflow, write_to_comet in sequence and finally perform a single timers.write(..., reset=True) (or invoke the timers' reset once) so that write_to_comet (and other downstream writers) receive the same data; update the use of reset_in_tb and the calls to timers.write, timers.write_to_wandb, timers.write_to_mlflow, and timers.write_to_comet accordingly.
498-646:⚠️ Potential issue | 🟠 MajorDon’t gate Comet metrics on a TensorBoard writer.
These new
comet_logger.log_metrics(...)calls are still under the surroundingif writer and .../if writer:guards. When TensorBoard is disabled, Comet silently drops these core metrics even though the experiment is active. Move the interval guard outside and only gate thewriter.add_scalar(...)calls.Also applies to: 713-720
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/utils/train_utils.py` around lines 498 - 646, The Comet logging calls (comet_logger.log_metrics(...)) are incorrectly nested under the TensorBoard writer guards so Comet metrics are dropped when TensorBoard is disabled; modify the metric-emission blocks (e.g., the blocks using writer.add_scalar(...) and comet_logger.log_metrics(...) around memory_report, runtime_report, l2_report, loss_dict, etc.) to only gate writer.add_scalar(...) with the writer condition while always executing comet_logger.log_metrics(...) when comet_logger is present, and replicate the same change for the other affected block mentioned (the similar guard around the block at lines referenced in the review). Ensure you update the conditional structure (use separate if writer: for writer.add_scalar calls and a separate if comet_logger: for comet_logger.log_metrics calls) for functions/variables such as writer.add_scalar, comet_logger.log_metrics, wandb_writer.log, and mlflow_logger.log_metrics so Comet logging is no longer behind the TensorBoard writer guard.src/megatron/bridge/training/state.py (1)
436-448:⚠️ Potential issue | 🟠 MajorCall cleanup methods on W&B and Comet loggers in
reset_for_restartbefore clearing them.The
reset_for_restartmethod clears both_wandb_loggerand_comet_loggerwithout calling their respective cleanup methods. In the normal teardown paths intrain.py(lines 644-645 and 1320-1323),.finish()is called on the W&B logger and.end()is called on the Comet experiment before clearing. During an in-process restart, these experiments may be abandoned without proper closure, potentially losing buffered metrics.🔧 Suggested fix
def reset_for_restart(self) -> None: """Reset GlobalState components for in-process restart. This cleans up all stateful components that need to be reinitialized between restart iterations. The async calls queue for checkpointing is handled separately in aborting in order to clean up persistent workers. """ self._timers = None self._train_state = None self._tensorboard_logger = None + if self._wandb_logger is not None: + self._wandb_logger.finish() self._wandb_logger = None self._mlflow_logger = None + if self._comet_logger is not None: + self._comet_logger.end() self._comet_logger = None self._energy_monitor = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/state.py` around lines 436 - 448, In reset_for_restart, call the proper cleanup methods on the loggers before nulling them: if self._wandb_logger is set, call its finish() method (handle/ignore exceptions) then set self._wandb_logger = None; similarly, if self._comet_logger is set, call its end() method (handle/ignore exceptions) then set self._comet_logger = None; keep the rest of the existing resets (self._timers, self._train_state, etc.) unchanged and ensure these calls occur prior to clearing those attributes in the reset_for_restart method.
🧹 Nitpick comments (3)
tests/unit_tests/training/test_config.py (1)
2913-2929: Make the positive Comet tests prove the branch ran.These cases only assert “no exception”. If
finalize()stops treatingcomet_workspace,comet_api_key, orcomet_tagsas Comet activators, they still pass. Please assert that the patchedimportlib.import_modulewas exercised, and add standalonecomet_api_key/comet_tagscases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/training/test_config.py` around lines 2913 - 2929, The positive Comet tests currently only check for no exception; update test_finalize_with_comet_workspace_only and test_finalize_with_all_comet_settings to assert that the patched importlib.import_module mock was actually called (e.g., get the mock from the patch context and assert_called/assert_called_once), and add two new tests that construct LoggerConfig with only comet_api_key and only comet_tags respectively and likewise patch importlib.import_module and assert it was exercised; locate references to LoggerConfig.finalize and the test functions test_finalize_with_comet_workspace_only/test_finalize_with_all_comet_settings to modify and add these assertions and new test cases.src/megatron/bridge/training/tensor_inspect.py (1)
75-78: Update docstring to include Comet ML.The docstring still references only TensorBoard and W&B, but the function now also attaches Comet ML loggers.
📝 Suggested docstring update
def _maybe_attach_metric_loggers( tensorboard_logger: Any | None, wandb_logger: Any | None, comet_logger: Any | None ) -> None: - """Attach supported metric loggers (TensorBoard, W&B raw module).""" + """Attach supported metric loggers (TensorBoard, W&B, Comet ML)."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/tensor_inspect.py` around lines 75 - 78, The docstring for _maybe_attach_metric_loggers currently mentions only TensorBoard and W&B; update it to include Comet ML so it accurately describes all supported metric loggers (TensorBoard, W&B raw module, and Comet ML) and reference the parameters tensorboard_logger, wandb_logger, and comet_logger to clarify which logger each argument accepts; keep the docstring concise and consistent with the existing style.tests/unit_tests/training/test_state.py (1)
989-1074: Consider adding a positive test case for successful Comet logger initialization.The test class covers disabled/None/wrong-rank/missing-name scenarios, but lacks a test verifying successful initialization on the last rank with valid config (similar to
test_wandb_logger_property_enabled_rank_n_minus_1at line 372 andtest_mlflow_logger_property_enabled_rank_n_minus_1at line 680).💡 Example test case
def test_comet_logger_property_enabled_rank_n_minus_1(self): """Test comet logger enabled for rank N-1.""" state = GlobalState() mock_config = MagicMock() mock_config.logger.comet_project = "test_project" mock_config.logger.comet_experiment_name = "test_experiment" mock_config.logger.comet_api_key = None mock_config.logger.comet_workspace = "test_workspace" mock_config.logger.comet_tags = ["tag1", "tag2"] mock_config.to_dict.return_value = {"config": "data"} state._cfg = mock_config mock_experiment = MagicMock() mock_comet_ml = MagicMock() mock_comet_ml.Experiment.return_value = mock_experiment with ( patch("megatron.bridge.training.state.get_rank_safe", return_value=3), patch("megatron.bridge.training.state.get_world_size_safe", return_value=4), patch.dict("sys.modules", {"comet_ml": mock_comet_ml}), ): # Re-import to use patched comet_ml import importlib import megatron.bridge.training.state as state_module importlib.reload(state_module) state = state_module.GlobalState() state._cfg = mock_config with ( patch("megatron.bridge.training.state.get_rank_safe", return_value=3), patch("megatron.bridge.training.state.get_world_size_safe", return_value=4), ): logger = state.comet_logger mock_comet_ml.Experiment.assert_called_once() mock_experiment.set_name.assert_called_once_with("test_experiment") mock_experiment.add_tags.assert_called_once_with(["tag1", "tag2"]) mock_experiment.log_parameters.assert_called_once() assert logger == mock_experiment🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/training/test_state.py` around lines 989 - 1074, Add a positive unit test that verifies GlobalState.comet_logger initializes a Comet Experiment on the last rank: create a MagicMock config with logger.comet_project, logger.comet_experiment_name, logger.comet_workspace, logger.comet_tags and to_dict(), patch get_rank_safe to return world_size-1 and get_world_size_safe accordingly, inject a mocked comet_ml into sys.modules whose Experiment returns a mock_experiment, reload megatron.bridge.training.state to pick up the patched module, then instantiate GlobalState and assert mock_comet_ml.Experiment was called, mock_experiment.set_name called with the experiment name, mock_experiment.add_tags called with the tags, mock_experiment.log_parameters called, and that state.comet_logger equals mock_experiment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/training/checkpointing.py`:
- Around line 837-849: The comet_finalize_fn closure currently reads
train_state.step and state.comet_logger at execution time; snapshot the needed
values eagerly so the async callback uses the intended checkpoint context:
before defining comet_finalize_fn capture local variables like step_snapshot =
train_state.step and comet_logger_snapshot = state.comet_logger (and keep
checkpoint_name/save_dir as they already are), then have comet_finalize_fn call
comet_utils.on_save_checkpoint_success(checkpoint_name, save_dir, step_snapshot,
comet_logger=comet_logger_snapshot); register that closure via
async_save_request.add_finalize_fn as before.
In `@src/megatron/bridge/training/utils/comet_utils.py`:
- Around line 21-48: The on_save_checkpoint_success function currently ignores
the save_dir parameter causing inconsistency with wandb_utils.py and
mlflow_utils.py; keep the signature and update the implementation and docstring
to log save_dir as metadata to Comet ML (e.g., call
comet_logger.log_other("checkpoint_save_dir", str(Path(save_dir).resolve())) and
also include save_dir when logging the resolved checkpoint path/iteration),
ensure you still resolve checkpoint_path via Path and wrap logs in the existing
try/except, and update the docstring to state that save_dir is recorded to Comet
ML metadata; use the existing symbols on_save_checkpoint_success,
comet_logger.log_other, Path, and the except block for error handling.
---
Outside diff comments:
In `@src/megatron/bridge/training/state.py`:
- Around line 436-448: In reset_for_restart, call the proper cleanup methods on
the loggers before nulling them: if self._wandb_logger is set, call its finish()
method (handle/ignore exceptions) then set self._wandb_logger = None; similarly,
if self._comet_logger is set, call its end() method (handle/ignore exceptions)
then set self._comet_logger = None; keep the rest of the existing resets
(self._timers, self._train_state, etc.) unchanged and ensure these calls occur
prior to clearing those attributes in the reset_for_restart method.
In `@src/megatron/bridge/training/utils/train_utils.py`:
- Around line 459-467: The timers are being reset before some sinks (notably
Comet) get written; change the logic in train_utils.py so the reset occurs only
after the last sink is written or produce a single snapshot and fan it out.
Specifically, call timers.write(...) with reset=False (or capture a snapshot via
a non-resetting API) then call write_to_wandb, write_to_mlflow, write_to_comet
in sequence and finally perform a single timers.write(..., reset=True) (or
invoke the timers' reset once) so that write_to_comet (and other downstream
writers) receive the same data; update the use of reset_in_tb and the calls to
timers.write, timers.write_to_wandb, timers.write_to_mlflow, and
timers.write_to_comet accordingly.
- Around line 498-646: The Comet logging calls (comet_logger.log_metrics(...))
are incorrectly nested under the TensorBoard writer guards so Comet metrics are
dropped when TensorBoard is disabled; modify the metric-emission blocks (e.g.,
the blocks using writer.add_scalar(...) and comet_logger.log_metrics(...) around
memory_report, runtime_report, l2_report, loss_dict, etc.) to only gate
writer.add_scalar(...) with the writer condition while always executing
comet_logger.log_metrics(...) when comet_logger is present, and replicate the
same change for the other affected block mentioned (the similar guard around the
block at lines referenced in the review). Ensure you update the conditional
structure (use separate if writer: for writer.add_scalar calls and a separate if
comet_logger: for comet_logger.log_metrics calls) for functions/variables such
as writer.add_scalar, comet_logger.log_metrics, wandb_writer.log, and
mlflow_logger.log_metrics so Comet logging is no longer behind the TensorBoard
writer guard.
---
Nitpick comments:
In `@src/megatron/bridge/training/tensor_inspect.py`:
- Around line 75-78: The docstring for _maybe_attach_metric_loggers currently
mentions only TensorBoard and W&B; update it to include Comet ML so it
accurately describes all supported metric loggers (TensorBoard, W&B raw module,
and Comet ML) and reference the parameters tensorboard_logger, wandb_logger, and
comet_logger to clarify which logger each argument accepts; keep the docstring
concise and consistent with the existing style.
In `@tests/unit_tests/training/test_config.py`:
- Around line 2913-2929: The positive Comet tests currently only check for no
exception; update test_finalize_with_comet_workspace_only and
test_finalize_with_all_comet_settings to assert that the patched
importlib.import_module mock was actually called (e.g., get the mock from the
patch context and assert_called/assert_called_once), and add two new tests that
construct LoggerConfig with only comet_api_key and only comet_tags respectively
and likewise patch importlib.import_module and assert it was exercised; locate
references to LoggerConfig.finalize and the test functions
test_finalize_with_comet_workspace_only/test_finalize_with_all_comet_settings to
modify and add these assertions and new test cases.
In `@tests/unit_tests/training/test_state.py`:
- Around line 989-1074: Add a positive unit test that verifies
GlobalState.comet_logger initializes a Comet Experiment on the last rank: create
a MagicMock config with logger.comet_project, logger.comet_experiment_name,
logger.comet_workspace, logger.comet_tags and to_dict(), patch get_rank_safe to
return world_size-1 and get_world_size_safe accordingly, inject a mocked
comet_ml into sys.modules whose Experiment returns a mock_experiment, reload
megatron.bridge.training.state to pick up the patched module, then instantiate
GlobalState and assert mock_comet_ml.Experiment was called,
mock_experiment.set_name called with the experiment name,
mock_experiment.add_tags called with the tags, mock_experiment.log_parameters
called, and that state.comet_logger equals mock_experiment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 312df4cc-c897-4181-bdd7-faed3c692fdc
📒 Files selected for processing (15)
docs/training/logging.mdpyproject.tomlsrc/megatron/bridge/recipes/run_plugins.pysrc/megatron/bridge/training/checkpointing.pysrc/megatron/bridge/training/config.pysrc/megatron/bridge/training/eval.pysrc/megatron/bridge/training/setup.pysrc/megatron/bridge/training/state.pysrc/megatron/bridge/training/tensor_inspect.pysrc/megatron/bridge/training/train.pysrc/megatron/bridge/training/utils/comet_utils.pysrc/megatron/bridge/training/utils/train_utils.pytests/unit_tests/training/test_config.pytests/unit_tests/training/test_state.pytests/unit_tests/training/utils/test_comet_utils.py
| def comet_finalize_fn() -> None: | ||
| comet_utils.on_save_checkpoint_success( | ||
| checkpoint_name, | ||
| save_dir, | ||
| train_state.step, | ||
| comet_logger=state.comet_logger, | ||
| ) | ||
|
|
||
| if ckpt_cfg.async_save: | ||
| assert async_save_request is not None | ||
| async_save_request.add_finalize_fn(wandb_finalize_fn) | ||
| async_save_request.add_finalize_fn(mlflow_finalize_fn) | ||
| async_save_request.add_finalize_fn(comet_finalize_fn) |
There was a problem hiding this comment.
Snapshot the Comet run and checkpoint step before registering the callback.
In async saves this closure executes later, so train_state.step may already refer to a newer step than checkpoint_name, and state.comet_logger is resolved after any logger lifecycle changes. Capture both values eagerly and pass the snapshots to on_save_checkpoint_success().
Suggested change
+ checkpoint_step = train_state.step
+ current_comet_logger = state.comet_logger
+
def comet_finalize_fn() -> None:
comet_utils.on_save_checkpoint_success(
checkpoint_name,
save_dir,
- train_state.step,
- comet_logger=state.comet_logger,
+ checkpoint_step,
+ comet_logger=current_comet_logger,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/checkpointing.py` around lines 837 - 849, The
comet_finalize_fn closure currently reads train_state.step and
state.comet_logger at execution time; snapshot the needed values eagerly so the
async callback uses the intended checkpoint context: before defining
comet_finalize_fn capture local variables like step_snapshot = train_state.step
and comet_logger_snapshot = state.comet_logger (and keep
checkpoint_name/save_dir as they already are), then have comet_finalize_fn call
comet_utils.on_save_checkpoint_success(checkpoint_name, save_dir, step_snapshot,
comet_logger=comet_logger_snapshot); register that closure via
async_save_request.add_finalize_fn as before.
| def on_save_checkpoint_success( | ||
| checkpoint_path: str, | ||
| save_dir: str, | ||
| iteration: int, | ||
| comet_logger: Optional[Any], | ||
| ) -> None: | ||
| """Callback executed after a checkpoint is successfully saved. | ||
|
|
||
| If a Comet ML experiment is provided, records the checkpoint path and iteration | ||
| as experiment metadata. | ||
|
|
||
| Args: | ||
| checkpoint_path: The path to the specific checkpoint file/directory saved. | ||
| save_dir: The base directory where checkpoints are being saved. | ||
| iteration: The training iteration at which the checkpoint was saved. | ||
| comet_logger: The Comet ML Experiment instance. | ||
| If None, this function is a no-op. | ||
| """ | ||
| if comet_logger is None: | ||
| return | ||
|
|
||
| try: | ||
| resolved_ckpt = str(Path(checkpoint_path).resolve()) | ||
| comet_logger.log_other("last_saved_checkpoint", resolved_ckpt) | ||
| comet_logger.log_other("last_saved_iteration", iteration) | ||
| except Exception as exc: | ||
| print_rank_last(f"Failed to log checkpoint information to Comet ML: {exc}") | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how wandb_utils and mlflow_utils use save_dir in their on_save_checkpoint_success
rg -n "on_save_checkpoint_success" --type=py -A 20 src/megatron/bridge/training/utils/wandb_utils.py src/megatron/bridge/training/utils/mlflow_utils.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3939
🏁 Script executed:
#!/bin/bash
# Get complete wandb implementation
echo "=== WANDB UTILS ==="
rg -n "on_save_checkpoint_success" --type=py -A 30 src/megatron/bridge/training/utils/wandb_utils.py
echo ""
echo "=== MLFLOW UTILS ==="
# Get complete mlflow implementation
rg -n "on_save_checkpoint_success" --type=py -A 30 src/megatron/bridge/training/utils/mlflow_utils.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3177
save_dir parameter is unused and inconsistent with other logger implementations.
The save_dir parameter is documented but never used in the function body, while both wandb_utils.py and mlflow_utils.py actively use it to construct artifact metadata. Either use save_dir for additional Comet ML logging (to maintain API consistency with other backends) or remove it from the signature.
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 46-46: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/utils/comet_utils.py` around lines 21 - 48, The
on_save_checkpoint_success function currently ignores the save_dir parameter
causing inconsistency with wandb_utils.py and mlflow_utils.py; keep the
signature and update the implementation and docstring to log save_dir as
metadata to Comet ML (e.g., call comet_logger.log_other("checkpoint_save_dir",
str(Path(save_dir).resolve())) and also include save_dir when logging the
resolved checkpoint path/iteration), ensure you still resolve checkpoint_path
via Path and wrap logs in the existing try/except, and update the docstring to
state that save_dir is recorded to Comet ML metadata; use the existing symbols
on_save_checkpoint_success, comet_logger.log_other, Path, and the except block
for error handling.
Signed-off-by: Shane Moran <shane.moran@shopify.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Shane Moran <shane.moran@shopify.com> Signed-off-by: Li Ding <liding@nvidia.com>
Summary
comet_project,comet_experiment_name,comet_workspace,comet_api_key,comet_tags) inLoggerConfigwithfinalize()validationGlobalState.comet_loggerproperty lazily initializes acomet_ml.Experimenton the last rank, logs full training config as parameters, and supports tagstraining_log()now dispatch to Comet alongside WandB and MLflow; validation metrics ineval.pyalso logged_timers_write_to_comet(no metric name sanitization — Comet supports/natively)comet_utils.pywired intocheckpointing.py_CometExperimentLoggerwrapper for NVIDIA DLFw Inspect MetricLoggerCometPluginadded torun_plugins.pyfor NeMo Run launchingdocs/training/logging.mdFixes: #2652
Based on @shanecmoran's original PR #2653, rebased onto current main with uv.lock conflict resolved.
Test plan
LoggerConfigcomet fields (test_config.py)GlobalState.comet_loggerproperty and_timers_write_to_comet(test_state.py)comet_utils.pycheckpoint callbacks (test_comet_utils.py)comet_projectset and verify metrics appear in Comet dashboard🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests