-
Notifications
You must be signed in to change notification settings - Fork 70
Description
One of the arguments that is always passed to the underlying bench_serving script is --random-range-ratio, passed in the wrapper in benchmark_lib.sh. This is set by default to 0.8 in benchmark-tmpl.yml and benchmark-multinode-tmpl.yml and is not overridden elsewhere. This argument is ultimately used in sample_random_requests to sample input and output lengths between range_ratio * {input,output}_len and {input,output}_len.
The result of this is that the average lengths will be ~90% of the advertised figure. e.g. a workload advertised as having 8k input or output tokens (8192) will actually average ~7373 tokens. The throughput figures are calculated using the actual input and output sequence lengths so the throughput figures do match what was observed. But the overall consequence is:
- The reported end to end latency is misleading, as it represents the latency for shorter sequence lengths than advertised.
- As cost of serving a query doesn't necessarily scale with O(n), the tokens/second and throughput figures are also going to be slightly better than if the input sequence length and output sequence length had averaged the reported number of tokens for the workload.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status