Skip to content

Static cache + torch.compile: better documentation for prefill static sequence length  #29151

@fxmarty

Description

@fxmarty

Feature request

When using torch.compile, the prefill is recompiled for every new sequence length, which is slow. It may be nice to be able to compile only say for some sequence lengths (1, 2, 4, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, etc) on the fly depending on the input lengths, using some padding.

Motivation

torch.compile compilation is prohibitively slow even with #29114

If people want to use transformers + static cache + torch.compile, it should be FAST to run generate on new sequence lengths.

Your contribution

None for now

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions