[Model] Add OpenVLA model support#32390
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive testing framework for Vision-Language-Action (VLA) models, which is a valuable addition for ensuring correctness and preventing future regressions. The tests for consistency against the HuggingFace implementation, action token range validation, and determinism are well-structured and thorough. I've identified one high-severity issue in the dependency version checking logic that should be addressed to ensure tests run in a compatible environment.
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| return num_vision_tokens | ||
|
|
||
| @staticmethod | ||
| def decode_action_tokens( |
There was a problem hiding this comment.
Done - removed this unused method.
|
|
||
|
|
||
| @MULTIMODAL_REGISTRY.register_processor( | ||
| _build_openvla_processor, |
There was a problem hiding this comment.
Just directly pass OpenVLAMultiModalProcessor here
There was a problem hiding this comment.
Done - now passing OpenVLAMultiModalProcessor directly.
| This overrides the base implementation to use our custom preprocessing | ||
| that precomputes both DINOv2 and SigLIP normalizations in float32. | ||
| """ | ||
| import PIL.Image |
There was a problem hiding this comment.
Done - moved PIL.Image and numpy imports to the top of the file.
| img_size = self.image_sizes[0] if self.image_sizes else 224 | ||
|
|
||
| # DINOv2 encoder | ||
| self.featurizer = timm.create_model( |
There was a problem hiding this comment.
Only the timm models need to be moved to device. I suggest following the code structure of the other models that do this
There was a problem hiding this comment.
Done - now using set_default_torch_dtype(torch.float16) context manager when creating timm models (following minicpmv.py pattern), then converting to default dtype. Added _move_timm_to_device() using current_platform.device_type.
| return self.fc2(self.act(self.fc1(x))) | ||
|
|
||
|
|
||
| class ViTAttention(nn.Module): |
There was a problem hiding this comment.
Use MMEncoderAttention to apply the attention
There was a problem hiding this comment.
Done - removed all the unused ViT encoder classes (LayerScale, ViTMLP, ViTAttention, DINOv2Block, SigLIPBlock, AttentionPooling, DINOv2Encoder, SigLIPEncoder) since we use timm models directly which have their own optimized attention implementations.
| # Precompute normalization conversion (ImageNet -> SigLIP) | ||
| self._init_norm_conversion() | ||
|
|
||
| def _init_norm_conversion(self): |
There was a problem hiding this comment.
This should be done in the multi-modal processor rather than in the model
There was a problem hiding this comment.
Done - removed _init_norm_conversion() and _convert_imagenet_to_siglip() from the model. All normalization is now handled in OpenVLAMultiModalProcessor._preprocess_image_6channel(). The vision backbone forward method now requires 6-channel input and will raise an error if 3-channel input is provided.
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
2 similar comments
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| loader = AutoWeightsLoader(self) | ||
| loaded = loader.load_weights(transform_weights(), mapper=self.hf_to_vllm_mapper) | ||
|
|
||
| # Move timm models to device after weights are loaded |
There was a problem hiding this comment.
Why can't we move the timm models to the device immediately when they are loaded?
| dinov2_pixels = dinov2_pixels.to(device=device, dtype=model_dtype) | ||
| siglip_pixels = siglip_pixels.to(device=device, dtype=model_dtype) |
There was a problem hiding this comment.
This is likely not necessary since vLLM should handle it already
|
Btw, have you tried using Transformers backend to run the model in vLLM? If it works then you don't need to add a custom implementation to the repo. |
|
Thanks for the suggestion @DarkLight1337! I investigated the Transformers backend but it won't work for OpenVLA because:
I've also pushed a fix addressing your feedback about moving timm models to device immediately during initialization. |
|
Thanks for collaborating on the OpenVLA integration. I noticed this PR initially have some test files but are removed later. Since comparing the vllm and HF results of OpenVLA implementation can prove the correctness of vllm OpenVLA implementation. How about setting up an external github repo for the test files? @DarkLight1337 @PalmDr |
| mm_counts = mm_items.get_all_counts() | ||
| num_images = mm_counts.get("image", 0) | ||
|
|
||
| if num_images == 0: |
There was a problem hiding this comment.
This method shouldn't be called with count = 0 so we don't need to handle this case
There was a problem hiding this comment.
And even if it's called, we should return empty pixel vvalues instead of making a dummy image
The original test code had too many wrapper functions that are specific to this model. I prefer a test that compares the direct output of vLLM with the equivalent code in HF, so we don't need to install additional repositories. |
| @@ -387,6 +387,7 @@ | |||
| "MolmoForCausalLM": ("molmo", "MolmoForCausalLM"), | |||
| "Molmo2ForConditionalGeneration": ("molmo2", "Molmo2ForConditionalGeneration"), | |||
| "NVLM_D": ("nvlm_d", "NVLM_D_Model"), | |||
| "OpenVLAForActionPrediction": ("openvla", "OpenVLAForActionPrediction"), | |||
There was a problem hiding this comment.
You need to add the model to the test registry as well (the one with _HfExamplesIbfo)
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
OpenVLA is a 7B VLA (Vision-Language-Action) model for robotic manipulation that outputs discretized robot action tokens (7D: xyz, rpy, gripper). Architecture: - Vision: DINOv2 + SigLIP fused backbone via timm - Projector: 3-layer MLP (2176 -> 8704 -> 4096 -> 4096) - LLM: Llama-2-7B generating 7 action tokens (256 bins each) Key implementation details: - Custom 6-channel preprocessing in processor for exact HF matching - Uses timm models with proper dtype/device handling - Action tokens use vocabulary positions [32000, 32255] Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
- Add 'from err' to ImportError exception chain (B904) - Remove unused imports (ClassVar, Final, MultiModalConfig, etc.) - Sort imports with isort (I001) - Fix line length violations by restructuring code - Use X | None instead of Optional[X] (UP045) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
- BaseDummyInputsBuilder: vllm.multimodal.processing, not profiling - set_default_torch_dtype: vllm.utils.torch_utils, not vllm.utils Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
Address reviewer feedback: - Move device placement to _init_timm_models() instead of separate method - Remove _move_timm_to_device() method - Simplify load_weights() by removing separate device movement step Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
vLLM handles tensor placement automatically, so explicit .to() calls are not needed in the forward method. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
Test compares vLLM output tokens against HuggingFace transformers reference for 5 robot manipulation instructions. Expected result: 4/5 exact match (80%), matching SGLang's achievement. Sample 3 fails due to low model confidence. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jiafan Yu <joyjfy@gmail.com>
|
Hi @PalmDr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @PalmDr, I'm trying to run OpenVLA using your implementation from this PR. Specifically running your test file (tests/models/multimodal/test_openvla_consistency.py) but encountering issues. Environment
Files Added from Your PR
Issue 1: Build FailureTried to build your branch directly but got CUDA_HOME errors on ROCm. So I copied your OpenVLA files to latest vLLM main instead. Issue 2: Runtime Errorpython tests/models/multimodal/test_openvla_consistency.py TypeError: _LazyConfigMapping.init() missing 1 required positional argument: 'mapping' Root cause: vLLM V1 engine uses spawn multiprocessing, and transformers' _LazyConfigMapping fails during pickle deserialization. What I Tried
Questions
Any guidance would be really helpful! Thanks |
# Add OpenVLA (Vision-Language-Action) model support ## Summary There are 3 separate commits, and I'd recommend reviewing each commit independently. 1. Add OpenVLA (`openvla/openvla-7b`) model support to vLLM, based on upstream PR by Jiafan Yu: vllm-project#32390 2. Fix offline inference: dtype cast for pixel_values (float32→bfloat16), use `PromptIndexTargets.start()` for image token insertion (BOS may be absent in chat API path), register `OpenVLAConfig` 3. Support OpenVLA in HTTP serving and benchmarks: chat template fallback, non-streaming JSON response fallback, TTFT/ITL inference for non-streaming models, SSE `\r\n` normalization ## Test Results (Strix Halo / gfx1151, ROCm 7.13 nightly) All tests run on a single Radeon 8060S (gfx1151, 32 GiB VRAM), ROCm 7.13 nightly, PyTorch 2.10.0+rocm7.13.0a20260317. ### OpenVLA performance tests Each test processes unique 224x224 images one at a time (`max_num_seqs=1`, `enable_prefix_caching=False`, `mm_processor_cache_gb=0`). Warmup runs are excluded from measurement. MSE is computed against ground-truth action values from the BridgeData V2 dataset. | Test | Mode | Throughput | TTFT | TPOT | E2E Latency | Avg MSE | |------|------|-----------|------|------|-------------|---------| | 1a: Offline (eager) | `--enforce-eager` | 11.88 tok/s | 194 ms | 65.9 ms | 589 ms | 0.048 | | 1b: Offline (CUDAGraph) | default | 11.81 tok/s | 196 ms | 66.2 ms | 593 ms | 0.048 | | 2: HTTP serve (vllm-bench) | `vllm serve` + `vllm bench serve` | — | 197 ms | 65.7 ms | 591 ms | — | Notes: - Test 1a/1b throughput is aggregate (total output tokens / total wall time across all samples). Per-request E2E latency is consistent across all three tests (~590 ms for 7 action tokens). - Test 2 uses unique synthetic 224x224 images, `--max-concurrency 1`, `--no-enable-prefix-caching`. No batching or image pre-caching. ### LLM / VLM performance regression tests Baseline measured on `origin/matthias.awq_gemv` (commit `6dd38c17c`) before any OpenVLA changes. Post-change measured on the final branch (`daf4aed3f`). No performance regression detected. | Model | Metric | Baseline | Post-change | Delta | |-------|--------|----------|-------------|-------| | Qwen/Qwen3-4B-AWQ (LLM) | Prefill (tok/s) | 1213.65 | 1209.28 | −0.4% | | | Decode (tok/s) | 72.9 | 72.9 | 0.0% | | | TTFT (ms) | 105 | 106 | +1% | | | E2E latency (ms) | 1848 | 1847 | −0.1% | | Qwen/Qwen2-VL-2B-Instruct (VLM, `--synthetic-mm`) | Prefill (tok/s) | 220.98 | 220.30 | −0.3% | | | Decode (tok/s) | 67.8 | 67.8 | 0.0% | | | TTFT (ms) | 579 | 581 | +0.3% | | | E2E latency (ms) | 2451 | 2455 | +0.2% | ### Functional correctness Generated output was verified to be **byte-for-byte identical** before and after the changes using deterministic prompts (`temperature=0`). **LLM (Qwen3-4B-AWQ):** Sanity check with math prompts — identical responses. | Prompt | Baseline | Post-change | Match | |--------|----------|-------------|-------| | `1+1=` | `2 is a` | `2 is a` | **YES** | | `2+3=` | `5,` | `5,` | **YES** | **VLM (Qwen2-VL-2B-Instruct):** Three image-description prompts using a deterministic synthetic test image (320x240 PNG with red rectangle, blue rectangle, yellow ellipse on gradient background). All responses identical. | Prompt | Baseline | Post-change | Match | |--------|----------|-------------|-------| | "Describe the shapes and colors you see in this image." | "The image consists of three shapes: a red square, a yellow circle, and a blue square. The background is a gradient of colors, transitioning from dark blue at the top to a lighter blue at the bottom." | *(identical)* | **YES** | | "How many distinct colored shapes are in this image?" | "There are three distinct colored shapes in the image: a red square, a yellow circle, and a blue square." | *(identical)* | **YES** | | "What color is the rectangle on the left side of the image?" | "The rectangle on the left side of the image is red." | *(identical)* | **YES** |
|
Hi @DarkLight1337 @PalmDr, I’m Wang Yiwen (same as in issue #42100). I’d like to take over this PR, rebase it, fix the remaining review comments, and get OpenVLA merged. Working on it this week. |
Summary
This PR adds OpenVLA (Open Vision-Language-Action) model support to vLLM for robotic manipulation tasks. This addresses issue #14739.
Relationship to Previous Work
This PR builds on and completes the earlier work in PR #29738 by @yongming-qin. While that PR implemented the initial model architecture but noted that "the results of vllm and Transformers are different," this PR provides a complete, validated implementation.
Validation Results
We validated vLLM's OpenVLA outputs against HuggingFace reference implementation (externally, in a separate test repo):
[31744, 31999]temperature=0Architecture
Action Token Encoding
OpenVLA uses 256 bins mapped to vocabulary positions
[31744, 31999]:Changes
vllm/model_executor/models/openvla.pyvllm/transformers_utils/configs/openvla.pyvllm/model_executor/models/registry.pyvllm/transformers_utils/configs/__init__.pyTest Plan
vllm.LLM("openvla/openvla-7b")Fixes #14739
🤖 Generated with Claude Code