Skip to content

[docs] Refactor optimizing inference section #26723

@stevhliu

Description

@stevhliu

Hey y’all! I’m planning on simplifying the Optimizing Inference section to be more concise. The inference on many GPUs is essentially the same as the inference on one GPU doc. For example:

  1. The Flash Attention section links back to the single GPU doc.
  2. BetterTransformer decoder model and encoder model sections are more or less identical.
  3. The Advanced usage section is also the same.

With so much of the same content being reused, I think combining the GPU inference docs under a single page makes more sense instead of adding more noise to the docs. We can add a callout at the beginning of the doc saying these optimizations work on single and multi-GPU setups, and clearly document where usage between single and multi-GPUs differ (such as using max_memory to allocate RAM to each GPU).

Finally, I’m also considering removing the inference on specialized hardware doc since this has been empty for a while now with no new updates, and this also seems to be more of an Optimum topic.

Let me know what you think! cc @LysandreJik @MKhalusova @younesbelkada

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions