Skip to content

[Model] Add OpenVLA model support#32390

Open
PalmDr wants to merge 6 commits into
vllm-project:mainfrom
PalmDr:openvla-tests
Open

[Model] Add OpenVLA model support#32390
PalmDr wants to merge 6 commits into
vllm-project:mainfrom
PalmDr:openvla-tests

Conversation

@PalmDr

@PalmDr PalmDr commented Jan 15, 2026

Copy link
Copy Markdown

Summary

This PR adds OpenVLA (Open Vision-Language-Action) model support to vLLM for robotic manipulation tasks. This addresses issue #14739.

Relationship to Previous Work

This PR builds on and completes the earlier work in PR #29738 by @yongming-qin. While that PR implemented the initial model architecture but noted that "the results of vllm and Transformers are different," this PR provides a complete, validated implementation.

Validation Results

We validated vLLM's OpenVLA outputs against HuggingFace reference implementation (externally, in a separate test repo):

Check Status
Action tokens in correct range [31744, 31999] ✅ PASS
Deterministic outputs with temperature=0 ✅ PASS
HF comparison (4/5 exact token match) ✅ PASS

Note: HuggingFace's OpenVLA requires transformers==4.40.1. Newer versions may produce degenerate outputs.

Architecture

OpenVLA-7B Architecture:
┌─────────────┐   ┌─────────────┐
│   DINOv2    │   │   SigLIP    │   Vision Encoders
│  (1024-dim) │   │  (1152-dim) │
└──────┬──────┘   └──────┬──────┘
       └────────┬────────┘
                ▼
        ┌───────────────┐
        │ Concat (2176) │   Fused Features
        └───────┬───────┘
                ▼
        ┌───────────────┐
        │  MLP Projector│   3-layer, GELU
        │  → 4096-dim   │
        └───────┬───────┘
                ▼
        ┌───────────────┐
        │  Llama-2-7B   │   Language Model
        └───────┬───────┘
                ▼
        ┌───────────────┐
        │ 7 Action Tkns │   Robot Control
        └───────────────┘

Action Token Encoding

OpenVLA uses 256 bins mapped to vocabulary positions [31744, 31999]:

# Encoding (action → token)
bin_index = int((action + 1) * 128)  # action ∈ [-1, 1] → bin ∈ [0, 255]
token_id = 32000 - bin_index - 1     # bin → token ∈ [31744, 31999]

# Decoding (token → action)
bin_index = 32000 - token_id - 1
action = (2 * bin_index + 1) / 256 - 1

Changes

File Description
vllm/model_executor/models/openvla.py Full model implementation (~1000 lines)
vllm/transformers_utils/configs/openvla.py Config class
vllm/model_executor/models/registry.py Model registration
vllm/transformers_utils/configs/__init__.py Config registration

Test Plan

  • Validate model loads correctly with vllm.LLM("openvla/openvla-7b")
  • Verify action tokens are in expected range [31744, 31999]
  • Confirm deterministic outputs with temperature=0
  • Compare outputs against HuggingFace reference (external validation)

Fixes #14739

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify Bot added the multi-modality Related to multi-modality (#4194) label Jan 15, 2026
@mergify

mergify Bot commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive testing framework for Vision-Language-Action (VLA) models, which is a valuable addition for ensuring correctness and preventing future regressions. The tests for consistency against the HuggingFace implementation, action token range validation, and determinism are well-structured and thorough. I've identified one high-severity issue in the dependency version checking logic that should be addressed to ensure tests run in a compatible environment.

Comment thread tests/models/multimodal/action/test_vla_models.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment @cursor review or bugbot run to trigger another review on this PR

Comment thread tests/models/multimodal/action/vla_utils/runners.py Outdated
Comment thread tests/models/multimodal/action/test_vla_models.py Outdated
@PalmDr PalmDr changed the title [Model][Test] Add VLA model tests for OpenVLA consistency validation [Model] Add OpenVLA model support Jan 15, 2026
@mergify mergify Bot added the new-model Requests to new models label Jan 15, 2026
@mergify

mergify Bot commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@DarkLight1337 DarkLight1337 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments

Comment thread vllm/model_executor/models/openvla.py Outdated
return num_vision_tokens

@staticmethod
def decode_action_tokens(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks unused

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed this unused method.

Comment thread vllm/model_executor/models/openvla.py Outdated


@MULTIMODAL_REGISTRY.register_processor(
_build_openvla_processor,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just directly pass OpenVLAMultiModalProcessor here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - now passing OpenVLAMultiModalProcessor directly.

Comment thread vllm/model_executor/models/openvla.py Outdated
This overrides the base implementation to use our custom preprocessing
that precomputes both DINOv2 and SigLIP normalizations in float32.
"""
import PIL.Image

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move imports to the top

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - moved PIL.Image and numpy imports to the top of the file.

Comment thread vllm/model_executor/models/openvla.py Outdated
img_size = self.image_sizes[0] if self.image_sizes else 224

# DINOv2 encoder
self.featurizer = timm.create_model(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the timm models need to be moved to device. I suggest following the code structure of the other models that do this

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - now using set_default_torch_dtype(torch.float16) context manager when creating timm models (following minicpmv.py pattern), then converting to default dtype. Added _move_timm_to_device() using current_platform.device_type.

Comment thread vllm/model_executor/models/openvla.py Outdated
return self.fc2(self.act(self.fc1(x)))


class ViTAttention(nn.Module):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use MMEncoderAttention to apply the attention

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed all the unused ViT encoder classes (LayerScale, ViTMLP, ViTAttention, DINOv2Block, SigLIPBlock, AttentionPooling, DINOv2Encoder, SigLIPEncoder) since we use timm models directly which have their own optimized attention implementations.

Comment thread vllm/model_executor/models/openvla.py Outdated
# Precompute normalization conversion (ImageNet -> SigLIP)
self._init_norm_conversion()

def _init_norm_conversion(self):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in the multi-modal processor rather than in the model

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed _init_norm_conversion() and _convert_imagenet_to_siglip() from the model. All normalization is now handled in OpenVLAMultiModalProcessor._preprocess_image_6channel(). The vision backbone forward method now requires 6-channel input and will raise an error if 3-channel input is provided.

@mergify

mergify Bot commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify

mergify Bot commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

2 similar comments
@mergify

mergify Bot commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify

mergify Bot commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Comment thread vllm/model_executor/models/openvla.py Outdated
loader = AutoWeightsLoader(self)
loaded = loader.load_weights(transform_weights(), mapper=self.hf_to_vllm_mapper)

# Move timm models to device after weights are loaded

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we move the timm models to the device immediately when they are loaded?

Comment thread vllm/model_executor/models/openvla.py Outdated
Comment on lines +165 to +166
dinov2_pixels = dinov2_pixels.to(device=device, dtype=model_dtype)
siglip_pixels = siglip_pixels.to(device=device, dtype=model_dtype)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely not necessary since vLLM should handle it already

@DarkLight1337

Copy link
Copy Markdown
Member

Btw, have you tried using Transformers backend to run the model in vLLM? If it works then you don't need to add a custom implementation to the repo.

@PalmDr

PalmDr commented Jan 16, 2026

Copy link
Copy Markdown
Author

Thanks for the suggestion @DarkLight1337! I investigated the Transformers backend but it won't work for OpenVLA because:

  1. Custom vision architecture: OpenVLA uses Prismatic VLM which fuses DINOv2 + SigLIP encoders. These are loaded via timm models, not standard HuggingFace vision encoders.

  2. Non-standard processor: OpenVLA's PrismaticProcessor doesn't implement the _get_num_multimodal_tokens method that vLLM's Transformers backend expects for multimodal models.

  3. Special preprocessing: OpenVLA requires 6-channel preprocessing (3 channels for DINOv2 normalization + 3 for SigLIP normalization) which needs custom handling.

I've also pushed a fix addressing your feedback about moving timm models to device immediately during initialization.

@yongming-qin

Copy link
Copy Markdown
Contributor

Thanks for collaborating on the OpenVLA integration. I noticed this PR initially have some test files but are removed later. Since comparing the vllm and HF results of OpenVLA implementation can prove the correctness of vllm OpenVLA implementation.

How about setting up an external github repo for the test files? @DarkLight1337 @PalmDr

mm_counts = mm_items.get_all_counts()
num_images = mm_counts.get("image", 0)

if num_images == 0:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method shouldn't be called with count = 0 so we don't need to handle this case

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And even if it's called, we should return empty pixel vvalues instead of making a dummy image

@DarkLight1337

Copy link
Copy Markdown
Member

I noticed this PR initially have some test files but are removed later. Since comparing the vllm and HF results of OpenVLA implementation can prove the correctness of vllm OpenVLA implementation.

The original test code had too many wrapper functions that are specific to this model. I prefer a test that compares the direct output of vLLM with the equivalent code in HF, so we don't need to install additional repositories.

@@ -387,6 +387,7 @@
"MolmoForCausalLM": ("molmo", "MolmoForCausalLM"),
"Molmo2ForConditionalGeneration": ("molmo2", "Molmo2ForConditionalGeneration"),
"NVLM_D": ("nvlm_d", "NVLM_D_Model"),
"OpenVLAForActionPrediction": ("openvla", "OpenVLAForActionPrediction"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the model to the test registry as well (the one with _HfExamplesIbfo)

@mergify

mergify Bot commented Jan 24, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

PalmDr and others added 6 commits January 24, 2026 18:44
OpenVLA is a 7B VLA (Vision-Language-Action) model for robotic manipulation
that outputs discretized robot action tokens (7D: xyz, rpy, gripper).

Architecture:
- Vision: DINOv2 + SigLIP fused backbone via timm
- Projector: 3-layer MLP (2176 -> 8704 -> 4096 -> 4096)
- LLM: Llama-2-7B generating 7 action tokens (256 bins each)

Key implementation details:
- Custom 6-channel preprocessing in processor for exact HF matching
- Uses timm models with proper dtype/device handling
- Action tokens use vocabulary positions [32000, 32255]

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
- Add 'from err' to ImportError exception chain (B904)
- Remove unused imports (ClassVar, Final, MultiModalConfig, etc.)
- Sort imports with isort (I001)
- Fix line length violations by restructuring code
- Use X | None instead of Optional[X] (UP045)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
- BaseDummyInputsBuilder: vllm.multimodal.processing, not profiling
- set_default_torch_dtype: vllm.utils.torch_utils, not vllm.utils

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
Address reviewer feedback:
- Move device placement to _init_timm_models() instead of separate method
- Remove _move_timm_to_device() method
- Simplify load_weights() by removing separate device movement step

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
vLLM handles tensor placement automatically, so explicit .to() calls
are not needed in the forward method.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
Test compares vLLM output tokens against HuggingFace transformers reference
for 5 robot manipulation instructions. Expected result: 4/5 exact match (80%),
matching SGLang's achievement. Sample 3 fails due to low model confidence.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
@mergify

mergify Bot commented Jan 24, 2026

Copy link
Copy Markdown
Contributor

Hi @PalmDr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify

mergify Bot commented Jan 30, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @PalmDr.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@punkiestrahul

Copy link
Copy Markdown

Hi @PalmDr,

I'm trying to run OpenVLA using your implementation from this PR. Specifically running your test file (tests/models/multimodal/test_openvla_consistency.py) but encountering issues.

Environment

  • GPU: AMD Ryzen AI MAX (gfx1151, iGPU)
  • ROCm: 7.2
  • Python: 3.12
  • Transformers: 4.52.4

Files Added from Your PR

  • vllm/model_executor/models/openvla.py
  • vllm/transformers_utils/configs/openvla.py
  • Updated registry.py and configs/init.py

Issue 1: Build Failure

Tried to build your branch directly but got CUDA_HOME errors on ROCm. So I copied your OpenVLA files to latest vLLM main instead.

Issue 2: Runtime Error

python tests/models/multimodal/test_openvla_consistency.py

TypeError: _LazyConfigMapping.init() missing 1 required positional argument: 'mapping'

Root cause: vLLM V1 engine uses spawn multiprocessing, and transformers' _LazyConfigMapping fails during pickle deserialization.

What I Tried

  1. Moved from transformers.models.auto import CONFIG_MAPPING inside function (as suggested in vLLM issue [Bug, V1]: Service launch failed with v1 code and custom models #11055)

  2. After fixing pickle issue, got config type mismatch:
    Expected: vllm.transformers_utils.configs.openvla.OpenVLAConfig
    Found: transformers_modules...configuration_prismatic.OpenVLAConfig

Questions

  1. Which vLLM version/commit did you test on?
  2. How did you build vLLM? Any specific build flags or environment setup?
  3. Did you use V1 or V0 engine? Any specific env flags like VLLM_USE_V1=0?
  4. Are there additional patches not included in the PR?
  5. Could you share the exact working setup and build steps?

Any guidance would be really helpful!

Thanks

mgehre-amd added a commit to mgehre-amd/vllm that referenced this pull request Mar 27, 2026
# Add OpenVLA (Vision-Language-Action) model support

## Summary

There are 3 separate commits, and I'd recommend reviewing each commit
independently.
1. Add OpenVLA (`openvla/openvla-7b`) model support to vLLM, based on
upstream PR by Jiafan Yu:
vllm-project#32390
2. Fix offline inference: dtype cast for pixel_values
(float32→bfloat16), use `PromptIndexTargets.start()` for image token
insertion (BOS may be absent in chat API path), register `OpenVLAConfig`
3. Support OpenVLA in HTTP serving and benchmarks: chat template
fallback, non-streaming JSON response fallback, TTFT/ITL inference for
non-streaming models, SSE `\r\n` normalization


## Test Results (Strix Halo / gfx1151, ROCm 7.13 nightly)

All tests run on a single Radeon 8060S (gfx1151, 32 GiB VRAM), ROCm 7.13
nightly, PyTorch 2.10.0+rocm7.13.0a20260317.

### OpenVLA performance tests

Each test processes unique 224x224 images one at a time
(`max_num_seqs=1`, `enable_prefix_caching=False`,
`mm_processor_cache_gb=0`). Warmup runs are excluded from measurement.
MSE is computed against ground-truth action values from the BridgeData
V2 dataset.

| Test | Mode | Throughput | TTFT | TPOT | E2E Latency | Avg MSE |
|------|------|-----------|------|------|-------------|---------|
| 1a: Offline (eager) | `--enforce-eager` | 11.88 tok/s | 194 ms | 65.9
ms | 589 ms | 0.048 |
| 1b: Offline (CUDAGraph) | default | 11.81 tok/s | 196 ms | 66.2 ms |
593 ms | 0.048 |
| 2: HTTP serve (vllm-bench) | `vllm serve` + `vllm bench serve` | — |
197 ms | 65.7 ms | 591 ms | — |

Notes:
- Test 1a/1b throughput is aggregate (total output tokens / total wall
time across all samples). Per-request E2E latency is consistent across
all three tests (~590 ms for 7 action tokens).
- Test 2 uses unique synthetic 224x224 images, `--max-concurrency 1`,
`--no-enable-prefix-caching`. No batching or image pre-caching.

### LLM / VLM performance regression tests

Baseline measured on `origin/matthias.awq_gemv` (commit `6dd38c17c`)
before any OpenVLA changes. Post-change measured on the final branch
(`daf4aed3f`). No performance regression detected.

| Model | Metric | Baseline | Post-change | Delta |
|-------|--------|----------|-------------|-------|
| Qwen/Qwen3-4B-AWQ (LLM) | Prefill (tok/s) | 1213.65 | 1209.28 | −0.4%
|
| | Decode (tok/s) | 72.9 | 72.9 | 0.0% |
| | TTFT (ms) | 105 | 106 | +1% |
| | E2E latency (ms) | 1848 | 1847 | −0.1% |
| Qwen/Qwen2-VL-2B-Instruct (VLM, `--synthetic-mm`) | Prefill (tok/s) |
220.98 | 220.30 | −0.3% |
| | Decode (tok/s) | 67.8 | 67.8 | 0.0% |
| | TTFT (ms) | 579 | 581 | +0.3% |
| | E2E latency (ms) | 2451 | 2455 | +0.2% |

### Functional correctness

Generated output was verified to be **byte-for-byte identical** before
and after the changes using deterministic prompts (`temperature=0`).

**LLM (Qwen3-4B-AWQ):** Sanity check with math prompts — identical
responses.

| Prompt | Baseline | Post-change | Match |
|--------|----------|-------------|-------|
| `1+1=` | `2 is a` | `2 is a` | **YES** |
| `2+3=` | `5,` | `5,` | **YES** |

**VLM (Qwen2-VL-2B-Instruct):** Three image-description prompts using a
deterministic synthetic test image (320x240 PNG with red rectangle, blue
rectangle, yellow ellipse on gradient background). All responses
identical.

| Prompt | Baseline | Post-change | Match |
|--------|----------|-------------|-------|
| "Describe the shapes and colors you see in this image." | "The image
consists of three shapes: a red square, a yellow circle, and a blue
square. The background is a gradient of colors, transitioning from dark
blue at the top to a lighter blue at the bottom." | *(identical)* |
**YES** |
| "How many distinct colored shapes are in this image?" | "There are
three distinct colored shapes in the image: a red square, a yellow
circle, and a blue square." | *(identical)* | **YES** |
| "What color is the rectangle on the left side of the image?" | "The
rectangle on the left side of the image is red." | *(identical)* |
**YES** |
@yiwen101

Copy link
Copy Markdown

Hi @DarkLight1337 @PalmDr, I’m Wang Yiwen (same as in issue #42100). I’d like to take over this PR, rebase it, fix the remaining review comments, and get OpenVLA merged. Working on it this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models nvidia v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[New Model]:Can you support the VLA series models? For example, openVLA.

5 participants