Skip to content

Token throughput per MW is described as reflecting the generated tokens but is actually processed+generated tokens #293

@asb

Description

@asb

The only reference that provides clarity on the exact meaning of the reported results is the linked article introducing InferenceMAX.

The article states "Throughput is the rate at which each GPU can generate tokens (tok/s/gpu)".

In line with this, under the performance per MW section it consistently refers to figures in terms of the GPU generating that number of tokens (i.e. output):

  • "...the MI355X is able to generate 2,550,000 token/s per all in provisioned MW"
  • "...H100 can generate 900,000 token/s per MW while a B200 can generate 2.8M token/s per MW"
  • "...GB200 NVL72 delivers an ~8x improvement in token/s generated per all-in provisioned MW"

These all indicate that the throughput results correspond to token generation as opposed to the sum of token generation and prompt processing. But cross-referencing the json from a workflow run to the presented results, the figure being displayed appears to be the tput_per_gpu (i.e. combining input and output throughput). This conflicts with what is documented. Personally I think updating the graphs to show the metric that was documented (generated tokens) would be most helpful and intuitive, but fixing the documentation is of course an alternative fix.

In addition I think it would be worth having a glossary of terms and short description of each metric on the page in order to avoid confusion and to provide the information needed to interpret the graphs close to their presentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions