Skip to content

feat: Add native Comet ML experiment tracking#1411

Merged
hemildesai merged 4 commits into
NVIDIA-NeMo:mainfrom
LoganVegnaSHOP:feat/comet-logger
Mar 13, 2026
Merged

feat: Add native Comet ML experiment tracking#1411
hemildesai merged 4 commits into
NVIDIA-NeMo:mainfrom
LoganVegnaSHOP:feat/comet-logger

Conversation

@LoganVegnaSHOP

Copy link
Copy Markdown
Contributor

Summary

  • Add CometLogger class in nemo_automodel/components/loggers/comet_utils.py — mirrors the existing MLflowLogger pattern
  • Wire Comet logging into TrainFinetuneRecipeForNextTokenPrediction:
    • Training metrics (loss, TPS, grad_norm, lr, mem) via log_train_metrics()
    • Validation metrics (val_loss) via log_val_metrics()
    • MoE load balance metrics via _log_moe_metrics()
  • Enable by adding a comet: block to the training YAML config:
comet:
  project_name: "my-project"
  experiment_name: "my-run"      # optional, auto-generated from model name
  workspace: "my-workspace"      # optional, uses COMET_WORKSPACE env var
  api_key: null                  # optional, uses COMET_API_KEY env var
  tags: ["finetune", "llama"]    # optional

Motivation

Currently, Comet ML users must rely on Comet's wandb auto-patcher, which does not
reliably intercept wandb.log() calls because NeMo gates all wandb logging behind
if wandb.run is not None — requiring a valid wandb API key and active run even when
wandb is only used as a bridge to Comet. Native Comet support (matching the existing
MLflow integration pattern) eliminates this fragile dependency.

Test plan

  • Run a fine-tuning job with comet: config and verify metrics appear in Comet dashboard
  • Verify training metrics (loss, TPS, grad_norm, lr, mem) are logged at log_remote_every_steps frequency
  • Verify validation metrics (val_loss) are logged on each validation step
  • Verify MoE load balance metrics are logged (on MoE models)
  • Verify that omitting comet: config has no effect (backward compatible)
  • Verify comet_ml not installed raises clear ImportError

@copy-pr-bot

copy-pr-bot Bot commented Feb 27, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@LoganVegnaSHOP LoganVegnaSHOP marked this pull request as draft February 27, 2026 23:05
@LoganVegnaSHOP LoganVegnaSHOP marked this pull request as ready for review February 27, 2026 23:38
Comment thread nemo_automodel/components/loggers/comet_utils.py Outdated
@hemildesai

Copy link
Copy Markdown
Contributor

/ok to test b24a082

@akoumpa

akoumpa commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

/ok to test 1f47ddc

@akoumpa

akoumpa commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

/ok to test 497adc8

@akoumpa

akoumpa commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

/ok to test 88f0443

LoganVegnaSHOP and others added 4 commits March 11, 2026 15:37
Add CometLogger alongside the existing WandB and MLflow integrations.
This logs training metrics (loss, TPS, grad_norm, lr, mem),
validation metrics, and MoE load balance metrics directly to Comet
via the comet_ml SDK.

Enable by adding a `comet:` section to the training YAML config:

  comet:
    project_name: "my-project"
    experiment_name: "my-run"

The API key is read from the COMET_API_KEY environment variable.

Signed-off-by: Logan Vegna <logan.vegna@shopify.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
12 tests covering:
- build_comet factory (config parsing, auto-naming, missing config error)
- log_params delegation to experiment.log_parameters
- log_metrics type conversion (int, float, tensor scalar, tensor vector)
- log_metrics with and without step parameter
- Rank guards (non-rank-0 NO-OPs for all methods)
- Experiment-None guards (NO-OPs when experiment is None)
- end() and context manager lifecycle
- No experiment created on non-rank-0

Signed-off-by: Logan Vegna <logan.vegna@shopify.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
project_name is required and raises ValueError if missing.
workspace, api_key, and experiment_name remain optional with
sensible defaults (env vars / auto-generated from model name).

Signed-off-by: Logan Vegna <logan.vegna@shopify.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Apply ruff formatting to long MetricsSample lines in the mlflow tests.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
@hemildesai

Copy link
Copy Markdown
Contributor

/ok to test 95d8842

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants