FP8 Blockwise Training Tracker

We want to support DeepSeekV3-style FP8 blockwise training in torchao for both dense and MoE models. 

# Support for dense models (linears)
We can extend the fp8 blockwise training prototype for dense models [here](https://github.com/pytorch/ao/blob/main/torchao/prototype/blockwise_fp8_training/linear.py) which has the core functionality complete, but performance is unoptimized.

The work has been broken down into the following tasks, which anyone is free to work on:

- [x] Functionality
    - [x] 1x128 quantization for LHS activations, write to row major layout
    - [x] 128x1 quantization for RHS activations, write to col major layout
    - [x] 128x128  quantization for weights, write to col major layout
    - [x] 1x128 @ 128x128 gemm, use for:
         - [x] `output = input @ weight.t()`
         - [x] `dgrad = grad_output @ weight`
    - [x] 1x128 @ 128x1 gemm, use for:
         - [x]  `wgrad = grad_output.t() @ input`
    - [x] Autograd function implementing forward and [backward](url)
    - [x] DTensor handling for TP support
    - [x] Custom ops around all custom kernels for `torch.compile` composability
    - [x] Tests for FSDP, TP
    - [x] `quantize_` model conversion api peforming module swap of nn.Linear to FP8BlockwiseLinear (wraps autograd func)
    - [ ] [P1] fp8 blockwise all-gather for FSDP (would need to ensure weight-shards are divisible by 128x128 blocks, design TBD)
- [ ] Performance
    - [x] all quantization kernels run at 80%+ of peak achievable memory bandwidth on Hopper
         - [x] benchmark scripts for each quantization kernel 
    - [ ] all gemm kernels run at 60%+ of peak achievable TFLOPs/sec on Hopper 
         - [x] benchmark scripts for each gemm
- [ ] Integration into torchtitan
    - [ ] Validate loss convergence virtually identical to bf16 for 3k+ steps on full size Llama3 8b/70b 
    - [ ] Validate e2e throughput (TPS) improvement in same training run as above 
- [ ] Documentation
    - [ ] README
    - [ ] torchao docsite  
- [ ] Migrate out of prototype directory, integrate into `torchao.float8` module 
- [ ] **High level goal and completion criteria**: 
    - [ ] Virtually identical convergence training DSV3 16b/671b on H100s (length of training run depends on infra availability, global batch size and other hyper params - loosely speaking let's run to a validation loss of ~2.7). See [long term training stability section](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) of this blog for reference.
    - [x] 80%+ of roofline speedup for all linears in the dense FFN (first 3 layers of dsv3). These are big beefy FFNs, so should be achievable. 
         - Calculate roofline linear layer speedup using DSV3 16b and 671b shapes, and training configs from the paper (seq_len=4096, 1 microbatch propagating through each layer at a time - so "M" dim of the GEMM will always be 4096)
        - Use these [specs](https://github.com/pytorch/ao/blob/67a78e5814ceac00fa2a82514511b2f728fca3bd/torchao/testing/training/roofline_utils.py#L18) for roofline estimates 
        - Use these model [dim and inter_dim](https://github.com/pytorch/torchtitan/blob/ea4989e0dfc15d214e69c9569b358da322087f46/torchtitan/models/deepseek_v3/__init__.py#L135-L136) values for K and N (and vice versa).


# Support for MoE layers (grouped GEMMs)
We can extend the low precision MoE training code [here](https://github.com/pytorch/ao/tree/main/torchao/prototype/moe_training) to support fp8 blockwise by doing the following: 
- [ ] Functionality
    - [ ] Quantization 
        - [ ] 128x128 quantization compatible with 3d expert weights, write to per-expert col major layout (e.g. shape (E,N,K) with strides (N*K,1,N)) 
        - [ ] Per-token group 1x128 scale conversion where group boundaries are along M
        - [ ]  Per-token group 128x1 scale conversion where group boundaries are along K/contracting dim
    - [ ] GEMMs 
         - [ ] 1x128 @ 128x128 scaled grouped gemm
             - [ ] `output = input @ weight.transpose(-2,-1)`
             - [ ] `dgrad = grad_output @ weight`
         - [ ] 1x128 @ 128x1 scaled grouped gemm
             - [ ] `wgrad = grad_output.transpose(-2,-1) @ input`
    - [ ] Autograd function implementing forward and backward with dynamic quant on inputs (see [mxfp8 example](https://github.com/pytorch/ao/blob/01374eb58a4f9ba52780efbe0ef4e056e36d338c/torchao/prototype/moe_training/scaled_grouped_mm.py#L284))
    - [ ] DTensor handling for TP support
    - [ ] Custom ops around all custom kernels for `torch.compile` composability
    - [ ] Tests for FSDP, TP
    - [ ] `quantize_` model conversion api peforming module swap of nn.Linear to FP8BlockwiseLinear (wraps autograd func)
- [ ] Performance
    - [ ] all quantization kernels run at 80%+ of peak achievable memory bandwidth on Hopper
         - [ ] benchmark scripts for each quantization kernel 
    - [x] all gemm kernels run at 60%+ of peak achievable TFLOPs/sec on Hopper 
         - [x] benchmark scripts for each gemm
- [ ] Integration into torchtitan
    - [ ] Validate loss convergence virtually identical to bf16 for 3k+ steps on full size DeepSeekV3 671b 
    - [ ] Validate e2e throughput (TPS) improvement in same training run as above 
 - [ ] Documentation
    - [ ] README
    - [ ] torchao docsite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 Blockwise Training Tracker #3290

Support for dense models (linears)

Support for MoE layers (grouped GEMMs)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

FP8 Blockwise Training Tracker #3290

Description

Support for dense models (linears)

Support for MoE layers (grouped GEMMs)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions