Training on b200:
These errors seem to oppose each other
AssertionError: When --fp8-param-gather is enabled, the optimizer cpu offload must be used in conjunction with --fp8-recipe delayed.
(MegatronTrainRayActor pid=336553, ip=10.142.0.5) /root/Megatron-LM/megatron/core/optimizer/optimizer_config.py:212: UserWarning: mxfp8 without using reuse_grad_buf_for_mxfp8_param_ag and fp8_param_gatherwill use significant amount additional GPU memory.Setting --reuse-grad-buf-for-mxfp8-param-ag and --fp8-param-gather is recommended for mxfp8 training.
Training on b200:
These errors seem to oppose each other
AssertionError: When
--fp8-param-gatheris enabled, the optimizer cpu offload must be used in conjunction with--fp8-recipe delayed.(MegatronTrainRayActor pid=336553, ip=10.142.0.5) /root/Megatron-LM/megatron/core/optimizer/optimizer_config.py:212: UserWarning: mxfp8 without using reuse_grad_buf_for_mxfp8_param_ag and fp8_param_gatherwill use significant amount additional GPU memory.Setting --reuse-grad-buf-for-mxfp8-param-ag and --fp8-param-gather is recommended for mxfp8 training.