-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Description
Hey y’all! I’m planning on simplifying the Optimizing Inference section to be more concise. The inference on many GPUs is essentially the same as the inference on one GPU doc. For example:
- The Flash Attention section links back to the single GPU doc.
- BetterTransformer decoder model and encoder model sections are more or less identical.
- The Advanced usage section is also the same.
With so much of the same content being reused, I think combining the GPU inference docs under a single page makes more sense instead of adding more noise to the docs. We can add a callout at the beginning of the doc saying these optimizations work on single and multi-GPU setups, and clearly document where usage between single and multi-GPUs differ (such as using max_memory to allocate RAM to each GPU).
Finally, I’m also considering removing the inference on specialized hardware doc since this has been empty for a while now with no new updates, and this also seems to be more of an Optimum topic.
Let me know what you think! cc @LysandreJik @MKhalusova @younesbelkada