feat: Add native Comet ML experiment tracking#1411
Merged
Conversation
hemildesai
reviewed
Feb 27, 2026
Contributor
|
/ok to test b24a082 |
b24a082 to
b69710c
Compare
Contributor
|
/ok to test 1f47ddc |
Contributor
|
/ok to test 497adc8 |
This was referenced Mar 4, 2026
Contributor
|
/ok to test 88f0443 |
Add CometLogger alongside the existing WandB and MLflow integrations.
This logs training metrics (loss, TPS, grad_norm, lr, mem),
validation metrics, and MoE load balance metrics directly to Comet
via the comet_ml SDK.
Enable by adding a `comet:` section to the training YAML config:
comet:
project_name: "my-project"
experiment_name: "my-run"
The API key is read from the COMET_API_KEY environment variable.
Signed-off-by: Logan Vegna <logan.vegna@shopify.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
12 tests covering: - build_comet factory (config parsing, auto-naming, missing config error) - log_params delegation to experiment.log_parameters - log_metrics type conversion (int, float, tensor scalar, tensor vector) - log_metrics with and without step parameter - Rank guards (non-rank-0 NO-OPs for all methods) - Experiment-None guards (NO-OPs when experiment is None) - end() and context manager lifecycle - No experiment created on non-rank-0 Signed-off-by: Logan Vegna <logan.vegna@shopify.com> Signed-off-by: hemildesai <hemild@nvidia.com>
project_name is required and raises ValueError if missing. workspace, api_key, and experiment_name remain optional with sensible defaults (env vars / auto-generated from model name). Signed-off-by: Logan Vegna <logan.vegna@shopify.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Apply ruff formatting to long MetricsSample lines in the mlflow tests. Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Contributor
|
/ok to test 95d8842 |
hemildesai
approved these changes
Mar 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CometLoggerclass innemo_automodel/components/loggers/comet_utils.py— mirrors the existingMLflowLoggerpatternTrainFinetuneRecipeForNextTokenPrediction:log_train_metrics()log_val_metrics()_log_moe_metrics()comet:block to the training YAML config:Motivation
Currently, Comet ML users must rely on Comet's wandb auto-patcher, which does not
reliably intercept
wandb.log()calls because NeMo gates all wandb logging behindif wandb.run is not None— requiring a valid wandb API key and active run even whenwandb is only used as a bridge to Comet. Native Comet support (matching the existing
MLflow integration pattern) eliminates this fragile dependency.
Test plan
comet:config and verify metrics appear in Comet dashboardlog_remote_every_stepsfrequencycomet:config has no effect (backward compatible)comet_mlnot installed raises clear ImportError