The only reference that provides clarity on the exact meaning of the reported results is the linked article introducing InferenceMAX.
The article states "Throughput is the rate at which each GPU can generate tokens (tok/s/gpu)".
In line with this, under the performance per MW section it consistently refers to figures in terms of the GPU generating that number of tokens (i.e. output):
- "...the MI355X is able to generate 2,550,000 token/s per all in provisioned MW"
- "...H100 can generate 900,000 token/s per MW while a B200 can generate 2.8M token/s per MW"
- "...GB200 NVL72 delivers an ~8x improvement in token/s generated per all-in provisioned MW"
These all indicate that the throughput results correspond to token generation as opposed to the sum of token generation and prompt processing. But cross-referencing the json from a workflow run to the presented results, the figure being displayed appears to be the tput_per_gpu (i.e. combining input and output throughput). This conflicts with what is documented. Personally I think updating the graphs to show the metric that was documented (generated tokens) would be most helpful and intuitive, but fixing the documentation is of course an alternative fix.
In addition I think it would be worth having a glossary of terms and short description of each metric on the page in order to avoid confusion and to provide the information needed to interpret the graphs close to their presentation.