Parakeet nemotron encoder by yhyang201 · Pull Request #23568 · sgl-project/sglang

yhyang201 · 2026-04-23T12:36:02Z

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

yhyang201 · 2026-04-23T12:36:18Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request introduces audio processing and dynamic resolution support for the Nemotron-VL model, including the integration of the Parakeet audio encoder and a utility for extracting audio from video. Key enhancements include temporal video compression via tubelet grouping and ragged packing for variable-sized images. Feedback identifies a critical bug in the extract_feature method where a missing projection layer causes a dimension mismatch, and a style violation regarding an inline import that should be moved to the top of the file.

gemini-code-assist · 2026-04-23T12:39:36Z

    def extract_feature(self, pixel_values):
-        # Process images in a micro-batch of at most 128 frames per call
-        # This is done on purpose to ensure peak GPU ram usage of huge batch
-        # (namely for really long videos with EVS ON) won't cause any problems
-        # as we don't support chunked prefill for video media
        micro_batch_size = 128
        n = pixel_values.shape[0]
+        patch_size = self.config.patch_size
+        h_patches = pixel_values.shape[-2] // patch_size
+        w_patches = pixel_values.shape[-1] // patch_size
        vit_embeds_list = []
        for i in range(0, n, micro_batch_size):
-            vit_embeds = self.vision_model(pixel_values[i : i + micro_batch_size])
-            vit_embeds = vit_embeds.to(dtype=torch.bfloat16)
-            h = w = int(vit_embeds.shape[1] ** 0.5)
-            vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
+            chunk = pixel_values[i : i + micro_batch_size]
+            batch_size = chunk.shape[0]
+            vit_embeds = self.vision_model(chunk)
+            vit_embeds = vit_embeds.to(dtype=self.model_dtype)
+            vit_embeds = vit_embeds.reshape(batch_size, h_patches, w_patches, -1)
            vit_embeds = self.pixel_shuffle(
                vit_embeds, scale_factor=self.downsample_ratio
            )
-            vit_embeds = vit_embeds.view(-1, self.rmsnorm_hidden_size)
-            vit_embeds = self.mlp1(vit_embeds)
-            vit_embeds = vit_embeds.view(n, -1, self.rmsnorm_hidden_size)
+            vit_embeds = vit_embeds.view(batch_size, -1, self.llm_hidden_size)
            vit_embeds_list.append(vit_embeds)
        vit_embeds = torch.cat(vit_embeds_list, dim=0)
        return vit_embeds


The extract_feature method is missing the application of the mlp1 projection layer. This will result in features with incorrect dimensions (rmsnorm_hidden_size instead of llm_hidden_size), which could lead to runtime errors or incorrect model behavior. Other feature extraction methods in this file, like extract_feature_dynamic and extract_video_feature_temporal, correctly apply this projection. This method should be updated to include the mlp1 projection to ensure feature dimensions are correct.

def extract_feature(self, pixel_values): micro_batch_size = 128 n = pixel_values.shape[0] patch_size = self.config.patch_size h_patches = pixel_values.shape[-2] // patch_size w_patches = pixel_values.shape[-1] // patch_size vit_embeds_list = [] for i in range(0, n, micro_batch_size): chunk = pixel_values[i : i + micro_batch_size] batch_size = chunk.shape[0] vit_embeds = self.vision_model(chunk) vit_embeds = vit_embeds.to(dtype=self.model_dtype) vit_embeds = vit_embeds.reshape(batch_size, h_patches, w_patches, -1) vit_embeds = self.pixel_shuffle( vit_embeds, scale_factor=self.downsample_ratio ) vit_embeds = vit_embeds.view(-1, self.rmsnorm_hidden_size) vit_embeds = self.mlp1(vit_embeds) vit_embeds = vit_embeds.view(batch_size, -1, self.llm_hidden_size) vit_embeds_list.append(vit_embeds) vit_embeds = torch.cat(vit_embeds_list, dim=0) return vit_embeds

gemini-code-assist · 2026-04-23T12:39:36Z

+        return clip_sizes
+
+    def _subsampling_output_length(self, length: int) -> int:
+        import math


According to the PEP 8 style guide, imports should be at the top of the file. Placing import math inside a method reduces readability and is not standard practice. Please move this import to the top of the module.

yhyang201 · 2026-04-24T19:36:23Z

The CI error is unrelated to this PR. Can we merge it? @mickqian

trangdough and others added 10 commits April 15, 2026 12:18

add parakeet audio encoder support

39ed742

fix copyright

e9730fd

remove draft test files

4164758

remove draft test files

67b8690

fix date

f3b2edd

Fix Parakeet audio encoder bugs

5347dfe

Add image dynamic resolution

de2ec4a

Add video temporal compression

cdf2fa2

Add audio-from-video extraction

e39c7eb

Fix video+audio EVS pad value erasure and multi-image dynamic resolution

29943c8

yhyang201 marked this pull request as ready for review April 23, 2026 12:36

yhyang201 requested review from CatherineSue, JustinTong0323, Ying1123, hnyls2002, ispobock, merrymercy, mickqian, slin1237, xiezhq-hermann and yuan-luo as code owners April 23, 2026 12:36

github-actions Bot added the run-ci label Apr 23, 2026

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

fix: add missing mlp1 projection in extract_feature

e571e7f

mickqian approved these changes Apr 24, 2026

View reviewed changes

mickqian merged commit 4a3fe2a into sgl-project:main Apr 25, 2026
1148 of 1258 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parakeet nemotron encoder#23568

Parakeet nemotron encoder#23568
mickqian merged 11 commits intosgl-project:mainfrom
yhyang201:parakeet-nemotron-encoder

yhyang201 commented Apr 23, 2026

Uh oh!

yhyang201 commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

yhyang201 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yhyang201 commented Apr 23, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

yhyang201 commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants