Skip to content

Presented input/output token throughput per GPU for disaggregated setups not usefully comparable to standard multi-gpu #299

@asb

Description

@asb

Describe the bug
The figure for input or output token throughput per GPU for disaggregated isn't wrong as such, but I don't think it's the correct figure to plot alongside the same metric for standard multi-GPU setups. For disaggregated setups the reported input/output throughput appears to be the figure per prefill GPU and per decode GPU respectively, which is why the throughput per GPU isn't the sum of the output and input throughput figures. The graph tooltip doesn't provide the context of how many GPUs are dedicated to prefill vs decode, and I think the value that is appropriate for plotting alongside the non-disaggregated setups would be to calculate the per-GPU output/input throughput by summing the throughput of the respective workers and dividing by the total GPUs. i.e. the output/input throughput per GPU average across the whole cluster rather than just averaged over the GPUs dedicated for prefill/decode. If I'm trying to compare and see which setup is better (and how much better), it's the per-GPU figure averaged across the whole GPU count that matters.

To Reproduce
Steps to reproduce the behavior:

  1. Go to inferencemax.semianalysis.com
  2. View results including a disaggregated config, e.g. R1 FP8 1k/1k
  3. Switch the y-axis metric between the different token throughput metrics

Expected behavior
See issue introduction.

But specifically looking at e.g. the results for R1 fp8 gb200 nvl72 conc=4096 1k/1k results. We currently get:

  • total per gpu
    • 3822.27
  • output per gpu
    • 2867.344
  • input per gpu
    • 5732.173

I believe the more comparable metric would be:

  • total per gpu
    • 3822.27
  • output per gpu
    • 2867.344 * (48/72) = 1911.56
  • input per gpu
    • 5732.173 * (24/72) = 1910.72

Now just like the non-disaggregated setups, total throughput per gpu is the sum of the output and input per GPU totals, and is representative of the average across the whole cluster. In the ideal case, the tooltip for the input/output per GPU figures for disaggregated datapoints would have a footnote that clarifies how it was averaged, such as:

Input token throughput per GPU*: 1911.56

*: Averaged across the whole 72-node cluster

For completeness, let's look at a datapoint from a non-disaggregated setup. Taking the mi325x conc=64 figures for example:

  • mi325x conc=64
    • total per gpu
      • 412.874 (= sum of output and input per GPU)
    • output per gpu
      • 206.483
    • input per gpu
      • 206.39

Screenshots

Image Image Image

Desktop (please complete the following information):
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions