[docs] Refactor optimizing inference section

Hey y’all! I’m planning on simplifying the Optimizing Inference section to be more concise. The [inference on many GPUs](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_many) is essentially the same as the [inference on one GPU](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one) doc. For example:

1. The Flash Attention section links back to the single GPU doc.
2. BetterTransformer decoder model and encoder model sections are more or less identical.
3. The Advanced usage section is also the same.

With so much of the same content being reused, I think combining the GPU inference docs under a single page makes more sense instead of adding more noise to the docs. We can add a callout at the beginning of the doc saying these optimizations work on single and multi-GPU setups, and clearly document where usage between single and multi-GPUs differ (such as using `max_memory` to allocate RAM to each GPU).

Finally, I’m also considering removing the [inference on specialized hardware](https://huggingface.co/docs/transformers/main/en/perf_infer_special) doc since this has been empty for a while now with no new updates, and this also seems to be more of an Optimum topic.

Let me know what you think! cc @LysandreJik @MKhalusova @younesbelkada 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Refactor optimizing inference section #26723

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[docs] Refactor optimizing inference section #26723

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions