[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it by danielvegamyhre · Pull Request #3004 · pytorch/ao

danielvegamyhre · 2025-09-15T02:19:42Z

Stacked PRs:

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it

Torchtitan Llama4 e2e training benchmarks

Llama4 debug model

dim=5120 (standard)
num_layers=2, num_experts=2 (to allow for higher seq len and avoid OOM)
FSDP=2
compile=True

Note there are typically 1-8 experts per device depending on EP degree (for llama4 and DSV3, so these tests simulate 2 and 8 experts per device. We can do real tests using EP once pytorch/torchtitan#1651 is resolved).

seq_len=8192, experts per device = 2

tl;dr:

mxfp8 dense only: 1.05x over bf16
mxfp8 moe + dense: 1.27x speedup over bf16

Config:

    "debugmodel": TransformerModelArgs(
        dim=5120,
        n_layers=2,
        n_heads=40,
        n_kv_heads=8,
        ffn_dim_multiplier=1.2,
        multiple_of=2048,
        rope_theta=500000,
        max_seq_len=10485760,
        moe_args=MoEArgs(num_experts=2),
        interleave_moe_layer_step=1,
    ),

BF16

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=2 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=1 --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh
Median Tokens/Second (excluding step 1): 66828.5
Max Memory Usage: 113.41 GiB

=========================================================================

MXFP8 DENSE ONLY

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=2 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh 
Median Tokens/Second (excluding step 1): 70277.0
Max Memory Usage: 113.35 GiB

=========================================================================

MXFP8 MOE + DENSE

seq_len=8192 -> total_M=65600

(torch) [danvm@devgpu007.snb3 ~/ao/benchmarks/float8/training (mx-moe-compile)]$ rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=2 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --mx.moe_fqns_prototype="experts" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh 
Median Tokens/Second (excluding step 1): 85169.0
Max Memory Usage: 112.35 GiB

seq_len=2048, experts per device = 8

tl;dr:

mxfp8 dense only: 1.08x over bf16
mxfp8 moe + dense: 1.19x speedup over bf16

Llama4 debug model

dim=5120 (standard)
num_layers=8, num_experts=8 (to avoid OOM)
FSDP=4
compile=True

Config:

    "debugmodel": TransformerModelArgs(
        dim=5120,
        n_layers=8,
        n_heads=40,
        n_kv_heads=8,
        ffn_dim_multiplier=1.2,
        multiple_of=2048,
        rope_theta=500000,
        max_seq_len=10485760,
        moe_args=MoEArgs(num_experts=8),
        interleave_moe_layer_step=1,
    ),

BF16

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable" ./llama4.sh 

Median Tokens/Second (excluding step 1): 24753.0
Max Memory Usage: 89.77 GiB

======================================================================

MX dense only

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable" ./llama4.sh 

Median Tokens/Second (excluding step 1): 26778.5
Max Memory Usage: 89.47 GiB

Speedup:  ~1.08x speedup over bf16

======================================================================

MX MoE + dense

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --mx.moe_fqns_prototype="experts" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable" ./llama4.sh 

Median Tokens/Second (excluding step 1): 29373.5
Max Memory Usage: 82.39 GiB

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

pytorch-bot · 2025-09-15T02:19:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3004

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

a9d2068

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre added a commit that referenced this pull request Sep 15, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

785a54a

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 0ebe1b4 to 785a54a Compare September 15, 2025 02:19

danielvegamyhre mentioned this pull request Sep 15, 2025

[mxfp8 moe training] add compile support #2990

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 15, 2025

danielvegamyhre added mx moe module: not user facing Use this tag if you don't want this PR to show up in release notes labels Sep 15, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 15, 2025 20:38

danielvegamyhre added a commit that referenced this pull request Sep 15, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

25ddd79

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 785a54a to 25ddd79 Compare September 15, 2025 20:38

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 15, 2025 20:38

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 15, 2025 21:02

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 25ddd79 to 2c2f7ff Compare September 15, 2025 21:02

danielvegamyhre added a commit that referenced this pull request Sep 15, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

2c2f7ff

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 15, 2025 21:02

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 15, 2025 21:20

danielvegamyhre added a commit that referenced this pull request Sep 15, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

dc26161

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 2c2f7ff to dc26161 Compare September 15, 2025 21:20

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 15, 2025 21:21

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 16, 2025 02:51

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from dc26161 to e597e40 Compare September 16, 2025 02:51

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 16, 2025 02:51

danielvegamyhre mentioned this pull request Sep 16, 2025

[mxfp8 moe training] fix kernel test for per group blocked format conversion #3008

Closed

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 16, 2025 02:59

danielvegamyhre added a commit that referenced this pull request Sep 16, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

893167c

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 17, 2025 15:48

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 6403a25 to a60ee11 Compare September 17, 2025 16:14

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

2e9feb5

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 92144cb to 2e9feb5 Compare September 17, 2025 16:14

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 17, 2025 16:19

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

830da32

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 2e9feb5 to 830da32 Compare September 17, 2025 16:19

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 17, 2025 16:19

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 6593572 to 367b67c Compare September 17, 2025 16:24

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

57ce18f

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 830da32 to 57ce18f Compare September 17, 2025 16:24

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 367b67c to bb8b07f Compare September 17, 2025 16:25

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

61833ea

…ate it stack-info: PR: #3004, branch: danielvegamyhre/stack/71

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 57ce18f to 61833ea Compare September 17, 2025 16:25

xmfan mentioned this pull request Sep 18, 2025

[inductor] as_strided lowering throws away .view(dtype) pytorch/pytorch#163286

Closed

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 19, 2025 03:35

danielvegamyhre force-pushed the danielvegamyhre/stack/71 branch from 61833ea to a68b868 Compare September 19, 2025 03:35

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 19, 2025 03:35

This was referenced Sep 19, 2025

[mxfp8 moe training] remove mxfp8_gemms.py #3033

Merged

[mxfp8 moe training] update 3d quant colwise scaling kernel to use single input/output TMA descriptors #3034

Merged

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 19, 2025 03:54

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 19, 2025 03:55

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 19, 2025 18:00

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 19, 2025 18:00

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 19, 2025 18:10

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 19, 2025 18:10

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 19, 2025 18:28

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/69 September 19, 2025 18:29

danielvegamyhre changed the base branch from danielvegamyhre/stack/69 to main September 19, 2025 18:36

drisspg approved these changes Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it#3004

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it#3004
danielvegamyhre merged 1 commit into
mainfrom
danielvegamyhre/stack/71

danielvegamyhre commented Sep 15, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielvegamyhre commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Torchtitan Llama4 e2e training benchmarks

seq_len=8192, experts per device = 2

seq_len=2048, experts per device = 8

Uh oh!

pytorch-bot Bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3004

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielvegamyhre commented Sep 15, 2025 •

edited

Loading

pytorch-bot Bot commented Sep 15, 2025 •

edited

Loading