Skip to content

fp8-param-gather for mxfp8 #2582

@bhaktatejas922

Description

@bhaktatejas922

Training on b200:
These errors seem to oppose each other

AssertionError: When --fp8-param-gather is enabled, the optimizer cpu offload must be used in conjunction with --fp8-recipe delayed.

(MegatronTrainRayActor pid=336553, ip=10.142.0.5) /root/Megatron-LM/megatron/core/optimizer/optimizer_config.py:212: UserWarning: mxfp8 without using reuse_grad_buf_for_mxfp8_param_ag and fp8_param_gatherwill use significant amount additional GPU memory.Setting --reuse-grad-buf-for-mxfp8-param-ag and --fp8-param-gather is recommended for mxfp8 training.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions