Skip to content

[Question] GPT-OSS-120B benchmark environment requirements - Driver/CUDA version clarification needed #393

@k-l-lambda

Description

@k-l-lambda

Summary

I am attempting to reproduce the InferenceMAX GPT-OSS-120B benchmarks on RunPod B200 but my vLLM results show a significant performance gap compared to SemiAnalysis benchmarks. I need clarification on the environment setup and configuration used.

My Environment

Component Version
GPU NVIDIA B200 (183GB VRAM, SM100)
Driver 570.195.03
CUDA 12.8.93
Platform RunPod
vLLM 0.13.0

Performance Gap

Comparing at similar throughput levels shows a large latency gap:

Source Output Throughput E2E Latency Concurrency
SemiAnalysis vLLM ~4,666 tok/s ~10s C=128
Our vLLM ~3,663 tok/s ~27s C=100
Our vLLM ~5,051 tok/s ~40s C=200

At comparable latency (~10s), SemiAnalysis achieves ~4,666 tok/s while our setup would be around ~2,000 tok/s - roughly 2x performance gap.

Our Full Results

Concurrency Output Throughput E2E Latency
C=1 215 tok/s 4.6s
C=20 1,370 tok/s 14.6s
C=50 2,427 tok/s 20.6s
C=100 3,663 tok/s 27.3s
C=200 5,051 tok/s 39.6s
C=300 5,725 tok/s 52.4s

Questions

  1. What driver version was used? RunPod B200 has driver 570.x (CUDA 12.8). Is driver 575+ (CUDA 13) required for optimal performance?

  2. What cloud platform was used? Different platforms may have different driver/software stacks.

  3. Is Docker required? The benchmark scripts reference Docker containers.

References

Any guidance on configuration or environment requirements would be appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions