[feature suggestion for vllm/vllm benchmark_serving]

The various request functions in [backend_request.func.py](https://github.com/kimbochen/bench_serving/blob/499c0b171b499b02a1fd546fb2326d2175a5d66e/backend_request_func.py) will set `output.success = False` if they don't get a HTTP 200 status code back for a request. There is no logic to retry a refused request and [metrics will be calculated skipping any failed requests](https://github.com/kimbochen/bench_serving/blob/499c0b171b499b02a1fd546fb2326d2175a5d66e/benchmark_serving.py#L485). This means an overloaded server will perform better on this benchmark for metrics like E2E latency and TTFT if it refuses requests rather than accept them and serve them slowly. As the number of failed requests isn't included in the results json it's not easy to tell if this is a factor for any benchmarks.

If one setup refuses requests under load and another accepts them there doesn't seem to be a fair way to directly compare these metrics. But hopefully this isn't happening. Adding the failure rate to the results output would mean this can be checked and investigated if it does happen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature suggestion for vllm/vllm benchmark_serving] #357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature suggestion for vllm/vllm benchmark_serving] #357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions