fp8-param-gather for mxfp8

Training on b200: 
These errors seem to oppose each other 


AssertionError: When `--fp8-param-gather` is enabled, the optimizer cpu offload must be used in conjunction with `--fp8-recipe delayed`.


(MegatronTrainRayActor pid=336553, ip=10.142.0.5) /root/Megatron-LM/megatron/core/optimizer/optimizer_config.py:212: UserWarning: mxfp8 without using reuse_grad_buf_for_mxfp8_param_ag and fp8_param_gatherwill use significant amount additional GPU memory.Setting --reuse-grad-buf-for-mxfp8-param-ag and --fp8-param-gather is recommended for mxfp8 training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8-param-gather for mxfp8 #2582

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fp8-param-gather for mxfp8 #2582

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions