Skip to content

track_moe_metrics() does not forward MoE metrics to Comet ML #2989

@LoganVegnaSHOP

Description

@LoganVegnaSHOP

Summary

PR #2910 added excellent Comet ML support to Megatron-Bridge with 18+ metric call sites in training_log() (thank you for that!). However, MoE-specific metrics (load balancing loss, z-loss, etc.) are not forwarded to Comet ML because track_moe_metrics() in Megatron-LM's moe_utils.py only writes to TensorBoard and W&B writers.

Current Behavior

In train_utils.py, training_log() calls track_moe_metrics() which accepts writer (TensorBoard) and wandb_writer parameters but has no comet_logger parameter. The MoE metrics are computed and reduced correctly but only written to TB/W&B:

track_moe_metrics(
    loss_scale=moe_loss_scale,
    iteration=iteration,
    writer=writer,
    wandb_writer=wandb_writer,
    total_loss_dict=total_loss_dict,
    ...
)

The comet_logger is available in the same scope but not used.

Expected Behavior

MoE metrics (load_balancing_loss, seq_load_balancing_loss, global_load_balancing_loss, z_loss) should be forwarded to Comet ML alongside TB/W&B, matching how all other metrics in training_log() are dispatched to all three logging backends.

Environment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions