-
Notifications
You must be signed in to change notification settings - Fork 70
Description
Describe the bug
The figure for input or output token throughput per GPU for disaggregated isn't wrong as such, but I don't think it's the correct figure to plot alongside the same metric for standard multi-GPU setups. For disaggregated setups the reported input/output throughput appears to be the figure per prefill GPU and per decode GPU respectively, which is why the throughput per GPU isn't the sum of the output and input throughput figures. The graph tooltip doesn't provide the context of how many GPUs are dedicated to prefill vs decode, and I think the value that is appropriate for plotting alongside the non-disaggregated setups would be to calculate the per-GPU output/input throughput by summing the throughput of the respective workers and dividing by the total GPUs. i.e. the output/input throughput per GPU average across the whole cluster rather than just averaged over the GPUs dedicated for prefill/decode. If I'm trying to compare and see which setup is better (and how much better), it's the per-GPU figure averaged across the whole GPU count that matters.
To Reproduce
Steps to reproduce the behavior:
- Go to inferencemax.semianalysis.com
- View results including a disaggregated config, e.g. R1 FP8 1k/1k
- Switch the y-axis metric between the different token throughput metrics
Expected behavior
See issue introduction.
But specifically looking at e.g. the results for R1 fp8 gb200 nvl72 conc=4096 1k/1k results. We currently get:
- total per gpu
- 3822.27
- output per gpu
- 2867.344
- input per gpu
- 5732.173
I believe the more comparable metric would be:
- total per gpu
- 3822.27
- output per gpu
- 2867.344 * (48/72) = 1911.56
- input per gpu
- 5732.173 * (24/72) = 1910.72
Now just like the non-disaggregated setups, total throughput per gpu is the sum of the output and input per GPU totals, and is representative of the average across the whole cluster. In the ideal case, the tooltip for the input/output per GPU figures for disaggregated datapoints would have a footnote that clarifies how it was averaged, such as:
Input token throughput per GPU*: 1911.56
*: Averaged across the whole 72-node cluster
For completeness, let's look at a datapoint from a non-disaggregated setup. Taking the mi325x conc=64 figures for example:
- mi325x conc=64
- total per gpu
- 412.874 (= sum of output and input per GPU)
- output per gpu
- 206.483
- input per gpu
- 206.39
- total per gpu
Screenshots
Desktop (please complete the following information):
N/A