feat: Add native Comet ML experiment tracking by LoganVegnaSHOP · Pull Request #1411 · NVIDIA-NeMo/Automodel

LoganVegnaSHOP · 2026-02-27T22:01:12Z

Summary

Add CometLogger class in nemo_automodel/components/loggers/comet_utils.py — mirrors the existing MLflowLogger pattern
Wire Comet logging into TrainFinetuneRecipeForNextTokenPrediction:
- Training metrics (loss, TPS, grad_norm, lr, mem) via log_train_metrics()
- Validation metrics (val_loss) via log_val_metrics()
- MoE load balance metrics via _log_moe_metrics()
Enable by adding a comet: block to the training YAML config:

comet:
  project_name: "my-project"
  experiment_name: "my-run"      # optional, auto-generated from model name
  workspace: "my-workspace"      # optional, uses COMET_WORKSPACE env var
  api_key: null                  # optional, uses COMET_API_KEY env var
  tags: ["finetune", "llama"]    # optional

Motivation

Currently, Comet ML users must rely on Comet's wandb auto-patcher, which does not
reliably intercept wandb.log() calls because NeMo gates all wandb logging behind
if wandb.run is not None — requiring a valid wandb API key and active run even when
wandb is only used as a bridge to Comet. Native Comet support (matching the existing
MLflow integration pattern) eliminates this fragile dependency.

Test plan

Run a fine-tuning job with comet: config and verify metrics appear in Comet dashboard
Verify training metrics (loss, TPS, grad_norm, lr, mem) are logged at log_remote_every_steps frequency
Verify validation metrics (val_loss) are logged on each validation step
Verify MoE load balance metrics are logged (on MoE models)
Verify that omitting comet: config has no effect (backward compatible)
Verify comet_ml not installed raises clear ImportError

copy-pr-bot · 2026-02-27T22:01:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

hemildesai · 2026-02-28T00:57:26Z

/ok to test b24a082

akoumpa · 2026-03-03T09:50:33Z

/ok to test 1f47ddc

akoumpa · 2026-03-04T19:42:58Z

/ok to test 497adc8

akoumpa · 2026-03-06T23:10:51Z

/ok to test 88f0443

Add CometLogger alongside the existing WandB and MLflow integrations. This logs training metrics (loss, TPS, grad_norm, lr, mem), validation metrics, and MoE load balance metrics directly to Comet via the comet_ml SDK. Enable by adding a `comet:` section to the training YAML config: comet: project_name: "my-project" experiment_name: "my-run" The API key is read from the COMET_API_KEY environment variable. Signed-off-by: Logan Vegna <logan.vegna@shopify.com> Signed-off-by: hemildesai <hemild@nvidia.com>

12 tests covering: - build_comet factory (config parsing, auto-naming, missing config error) - log_params delegation to experiment.log_parameters - log_metrics type conversion (int, float, tensor scalar, tensor vector) - log_metrics with and without step parameter - Rank guards (non-rank-0 NO-OPs for all methods) - Experiment-None guards (NO-OPs when experiment is None) - end() and context manager lifecycle - No experiment created on non-rank-0 Signed-off-by: Logan Vegna <logan.vegna@shopify.com> Signed-off-by: hemildesai <hemild@nvidia.com>

project_name is required and raises ValueError if missing. workspace, api_key, and experiment_name remain optional with sensible defaults (env vars / auto-generated from model name). Signed-off-by: Logan Vegna <logan.vegna@shopify.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Apply ruff formatting to long MetricsSample lines in the mlflow tests. Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: hemildesai <hemild@nvidia.com>

hemildesai · 2026-03-11T22:38:24Z

/ok to test 95d8842

LoganVegnaSHOP requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and hemildesai as code owners February 27, 2026 22:01

github-actions Bot added the community-request label Feb 27, 2026

LoganVegnaSHOP marked this pull request as draft February 27, 2026 23:05

LoganVegnaSHOP marked this pull request as ready for review February 27, 2026 23:38

hemildesai reviewed Feb 27, 2026

View reviewed changes

Comment thread nemo_automodel/components/loggers/comet_utils.py Outdated

copy-pr-bot Bot temporarily deployed to test February 28, 2026 00:57 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 28, 2026 00:57 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 28, 2026 01:16 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci February 28, 2026 02:59 Failure

LoganVegnaSHOP force-pushed the feat/comet-logger branch from b24a082 to b69710c Compare March 3, 2026 00:11

copy-pr-bot Bot temporarily deployed to nemo-ci March 3, 2026 09:50 Inactive

copy-pr-bot Bot temporarily deployed to test March 3, 2026 09:50 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 4, 2026 19:43 Inactive

copy-pr-bot Bot temporarily deployed to test March 4, 2026 19:43 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 4, 2026 19:47 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 4, 2026 19:55 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 4, 2026 20:14 Inactive

This was referenced Mar 4, 2026

feat: Add first-class Comet ML experiment tracking NVIDIA-NeMo/Megatron-Bridge#2652

Closed

feat: Add first-class Comet ML experiment tracking NVIDIA-NeMo/Megatron-Bridge#2653

Closed

copy-pr-bot Bot temporarily deployed to nemo-ci March 6, 2026 23:11 Inactive

LoganVegnaSHOP and others added 4 commits March 11, 2026 15:37

fix: add missing comet_logger attribute to mlflow logging test fixtures

95d8842

Apply ruff formatting to long MetricsSample lines in the mlflow tests. Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: hemildesai <hemild@nvidia.com>

hemildesai approved these changes Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add native Comet ML experiment tracking#1411

feat: Add native Comet ML experiment tracking#1411
hemildesai merged 4 commits into
NVIDIA-NeMo:mainfrom
LoganVegnaSHOP:feat/comet-logger

LoganVegnaSHOP commented Feb 27, 2026

Uh oh!

copy-pr-bot Bot commented Feb 27, 2026

Uh oh!

Uh oh!

hemildesai commented Feb 28, 2026

Uh oh!

akoumpa commented Mar 3, 2026

Uh oh!

akoumpa commented Mar 4, 2026

Uh oh!

akoumpa commented Mar 6, 2026

Uh oh!

hemildesai commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LoganVegnaSHOP commented Feb 27, 2026

Summary

Motivation

Test plan

Uh oh!

copy-pr-bot Bot commented Feb 27, 2026

Uh oh!

Uh oh!

hemildesai commented Feb 28, 2026

Uh oh!

akoumpa commented Mar 3, 2026

Uh oh!

akoumpa commented Mar 4, 2026

Uh oh!

akoumpa commented Mar 6, 2026

Uh oh!

hemildesai commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants