[training] feat: forward MoE/MTP metrics to MLFlow and Comet#3647
Merged
cuichenx merged 2 commits intoMay 5, 2026
Conversation
MCore's `track_moe_metrics` and `track_mtp_metrics` only forward metrics to TensorBoard and W&B. Users wiring up Comet (or MLFlow) never see MoE auxiliary losses (load balancing, sequence aux loss, global aux loss, z_loss) or MTP per-layer losses on those backends — see issue NVIDIA-NeMo#2989. Per maintainer guidance ("Megatron-Bridge side can monkey-patch track_moe_metrics to avoid a cross-repo dependency"), this change wraps the TB writer with a small SummaryWriter-shaped adapter that fans out every `add_scalar(name, value, iteration)` call to MLFlow and Comet using the same per-step value. W&B is left untouched — the underlying MCore functions still receive `wandb_writer` directly so their dict-based per-layer logging stays unchanged. When neither MLFlow nor Comet is configured, the helper returns the real TB writer unchanged — zero overhead and no behavior change. When at least one of MLFlow / Comet is configured, the wrapper is returned even if TB itself is None. This is intentional: it surfaces MoE / MTP metrics in Comet / MLFlow on rank N-1 even when the user hasn't enabled TensorBoard. Tensors are sanitized with `.item()` before being handed to MLFlow / Comet (TB tolerates 0-d tensors; MLFlow / Comet do not). Per-layer logging fans out one `add_scalar` per layer naturally. Adds 8 unit tests covering: bypass when no fanout targets, wrapping when only Comet or only MLFlow is present, fan-out across all sinks, operation when TB is None, tensor sanitation, plain-scalar passthrough, and per-layer loop fan-out. Refs issue NVIDIA-NeMo#2989. Signed-off-by: lonexreb <reach2shubhankar@gmail.com>
4 tasks
Contributor
|
/ok to test 6de6750 |
gautham-kollu
pushed a commit
that referenced
this pull request
May 12, 2026
Signed-off-by: lonexreb <reach2shubhankar@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com>
vasunvidia
pushed a commit
to vasunvidia/Megatron-Bridge
that referenced
this pull request
Jun 10, 2026
…NeMo#3647) Signed-off-by: lonexreb <reach2shubhankar@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MCore's
track_moe_metricsandtrack_mtp_metricsonly forward metrics to TensorBoard and W&B. Users wiring up Comet (or MLFlow) never see MoE auxiliary losses (load_balancing_loss,seq_load_balancing_loss,global_load_balancing_loss,z_loss) or MTP per-layer losses on those backends.Per maintainer guidance on the issue — "Megatron-Bridge side can monkey-patch
track_moe_metricsto avoid a cross-repo dependency" — wrap the TB writer with a smallSummaryWriter-shaped adapter that fans out everyadd_scalar(name, value, iteration)call to MLFlow and Comet using the same per-step value TB receives. W&B is unaffected — the underlying MCore functions still receivewandb_writerdirectly.Refs #2989.
Why a wrapper, not
total_loss_dictThe original report tried reading from
total_loss_dictfor the Comet path. That's wrong becausetotal_loss_dictaccumulates with+=across iterations, so values monotonically grow (the issue records 1.24 → 6.14 instead of the correct 1.20–1.24 per-step range). The wrapper approach captures the exact per-step averaged value that TB receives (loss_list.sum() / num_moe_layers) without any further bookkeeping.Implementation
src/megatron/bridge/training/utils/train_utils.py:_MoeMetricFanoutWriteradapter —add_scalar(name, value, iteration)forwards to TB (if any), MLFlow (log_metrics(..., step=iteration)), and Comet (log_metrics(..., step=iteration))._build_moe_metric_writer(tb_writer, comet_logger, mlflow_logger)factory:tb_writer is None, which is required to surface MoE / MTP metrics in Comet / MLFlow when the user hasn't enabled TensorBoard.track_moe_metrics(..., writer=moe_metric_writer, ...)MTPLossLoggingHelper.track_mtp_metrics(..., mtp_metric_writer, ...)Tensor sanitation: 0-d torch tensors are converted to Python scalars with
.item()before being handed to MLFlow / Comet — TB tolerates 0-d tensors, MLFlow / Comet do not. The TB writer still receives the original value untouched.Per-layer logging (
per_layer_logging=True) is naturally handled — eachwriter.add_scalar(f\"moe/{name}_layer_{i}\", ...)call insidetrack_moe_metricsfans out individually.Test plan
python3 -m astparse of changed filesruff checkclean on changed files (auto-fixed import order)ruff format --checkclean on changed filesTestMoeMetricFanoutWriter:_build_moe_metric_writerreturns the original writer when no Comet / MLFlow → bypass contractadd_scalarfans out to TB + Comet + MLFlow when all three are presentadd_scalarworks when TB writer is Nonetest_moe_logging*tests continue to pass — those tests do not provide acomet_logger/mlflow_logger, so the helper returns the original writer unchanged (the existing assertions onwriter.add_scalarare unaffected).mlflow_experimentorcomet_experimentconfigured and a MoE model, observeload_balancing_loss(and friends) appearing in MLFlow / Comet at the same per-step values as TensorBoard.Risk
Low.
_build_moe_metric_writerreturns the original writer object when no fanout target is configured — bit-for-bit equivalent to the previous code path.log_metrics(...)calls on the rank that already has those loggers configured (rank N-1 only).wandb_writeris still passed directly to MCore.total_loss_dictsemantics.Notes for reviewers
comet_loggerparameter), but the maintainer explicitly requested a Bridge-side workaround to avoid the cross-repo dependency.add_scalar(name, value, iteration)shape rather than monkey-patchingtrack_moe_metrics, so future MCore changes that add new metric emissions through the same writer interface get fanned out automatically.