Partitioner stores fp8 copy of all weights between fwd and bwd, causing OOM

### 🐛 Describe the bug

We have [some code](https://github.com/facebookresearch/lingua/blob/main/lingua/float8.py) to convert linear layers to fp8. The weights are still stored in high precision, but we have an autograd.Function which converts them to fp8 in the forward and then again in the backward, in two slightly different ways. The autograd.Function does _not_ save the weight for the backward.

However, when we compile our code, the partitioner ends up choosing to fuse the two conversions into a single one, and save one of its results for the backward. Concretely, this means that the partitioner is choosing to store an additional copy of the entire model in fp8 between forward and backward! This amounts to multiple GBs of extra memory occupied, and is preventing training large models.

I don't have a strong opinion on how this should be fixed. I do _not_ think that the partitioner should be constrained to honor exactly what the autograd.Functions choose to keep/drop, but I do believe that the partitioner should take into account the amount of memory used by eager as an upper bound.

### Versions

Installed from the `pytorch-nightly` conda channel, v2.6.0.dev20241107, build py3.12_cuda12.4_cudnn9.1.0_0.

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @chauhang @penguinwu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioner stores fp8 copy of all weights between fwd and bwd, causing OOM #141881

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Partitioner stores fp8 copy of all weights between fwd and bwd, causing OOM #141881

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions