[Question] GPT-OSS-120B benchmark environment requirements - Driver/CUDA version clarification needed

## Summary

I am attempting to reproduce the InferenceMAX GPT-OSS-120B benchmarks on RunPod B200 but my vLLM results show a significant performance gap compared to SemiAnalysis benchmarks. I need clarification on the environment setup and configuration used.

## My Environment

| Component | Version |
|-----------|---------|
| GPU | NVIDIA B200 (183GB VRAM, SM100) |
| Driver | 570.195.03 |
| CUDA | 12.8.93 |
| Platform | RunPod |
| vLLM | 0.13.0 |

## Performance Gap

Comparing at similar throughput levels shows a large latency gap:

| Source | Output Throughput | E2E Latency | Concurrency |
|--------|-------------------|-------------|-------------|
| **SemiAnalysis vLLM** | ~4,666 tok/s | ~10s | C=128 |
| **Our vLLM** | ~3,663 tok/s | ~27s | C=100 |
| **Our vLLM** | ~5,051 tok/s | ~40s | C=200 |

At comparable latency (~10s), SemiAnalysis achieves ~4,666 tok/s while our setup would be around ~2,000 tok/s - roughly **2x performance gap**.

## Our Full Results

| Concurrency | Output Throughput | E2E Latency |
|-------------|-------------------|-------------|
| C=1 | 215 tok/s | 4.6s |
| C=20 | 1,370 tok/s | 14.6s |
| C=50 | 2,427 tok/s | 20.6s |
| C=100 | 3,663 tok/s | 27.3s |
| C=200 | 5,051 tok/s | 39.6s |
| C=300 | 5,725 tok/s | 52.4s |

## Questions

1. **What driver version was used?** RunPod B200 has driver 570.x (CUDA 12.8). Is driver 575+ (CUDA 13) required for optimal performance?

1. **What cloud platform was used?** Different platforms may have different driver/software stacks.

1. **Is Docker required?** The benchmark scripts reference Docker containers.

## References

- vLLM benchmark script: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/gptoss_fp4_b200_docker.sh
- TRT-LLM benchmark script: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/gptoss_fp4_b200_trt_docker.sh
- NVIDIA deployment guide: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

Any guidance on configuration or environment requirements would be appreciated. Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] GPT-OSS-120B benchmark environment requirements - Driver/CUDA version clarification needed #393

Summary

My Environment

Performance Gap

Our Full Results

Questions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version
GPU	NVIDIA B200 (183GB VRAM, SM100)
Driver	570.195.03
CUDA	12.8.93
Platform	RunPod
vLLM	0.13.0

Source	Output Throughput	E2E Latency	Concurrency
SemiAnalysis vLLM	~4,666 tok/s	~10s	C=128
Our vLLM	~3,663 tok/s	~27s	C=100
Our vLLM	~5,051 tok/s	~40s	C=200

Concurrency	Output Throughput	E2E Latency
C=1	215 tok/s	4.6s
C=20	1,370 tok/s	14.6s
C=50	2,427 tok/s	20.6s
C=100	3,663 tok/s	27.3s
C=200	5,051 tok/s	39.6s
C=300	5,725 tok/s	52.4s

[Question] GPT-OSS-120B benchmark environment requirements - Driver/CUDA version clarification needed #393

Description

Summary

My Environment

Performance Gap

Our Full Results

Questions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions