Parakeet nemotron encoder#23568
Conversation
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request introduces audio processing and dynamic resolution support for the Nemotron-VL model, including the integration of the Parakeet audio encoder and a utility for extracting audio from video. Key enhancements include temporal video compression via tubelet grouping and ragged packing for variable-sized images. Feedback identifies a critical bug in the extract_feature method where a missing projection layer causes a dimension mismatch, and a style violation regarding an inline import that should be moved to the top of the file.
| def extract_feature(self, pixel_values): | ||
| # Process images in a micro-batch of at most 128 frames per call | ||
| # This is done on purpose to ensure peak GPU ram usage of huge batch | ||
| # (namely for really long videos with EVS ON) won't cause any problems | ||
| # as we don't support chunked prefill for video media | ||
| micro_batch_size = 128 | ||
| n = pixel_values.shape[0] | ||
| patch_size = self.config.patch_size | ||
| h_patches = pixel_values.shape[-2] // patch_size | ||
| w_patches = pixel_values.shape[-1] // patch_size | ||
| vit_embeds_list = [] | ||
| for i in range(0, n, micro_batch_size): | ||
| vit_embeds = self.vision_model(pixel_values[i : i + micro_batch_size]) | ||
| vit_embeds = vit_embeds.to(dtype=torch.bfloat16) | ||
| h = w = int(vit_embeds.shape[1] ** 0.5) | ||
| vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1) | ||
| chunk = pixel_values[i : i + micro_batch_size] | ||
| batch_size = chunk.shape[0] | ||
| vit_embeds = self.vision_model(chunk) | ||
| vit_embeds = vit_embeds.to(dtype=self.model_dtype) | ||
| vit_embeds = vit_embeds.reshape(batch_size, h_patches, w_patches, -1) | ||
| vit_embeds = self.pixel_shuffle( | ||
| vit_embeds, scale_factor=self.downsample_ratio | ||
| ) | ||
| vit_embeds = vit_embeds.view(-1, self.rmsnorm_hidden_size) | ||
| vit_embeds = self.mlp1(vit_embeds) | ||
| vit_embeds = vit_embeds.view(n, -1, self.rmsnorm_hidden_size) | ||
| vit_embeds = vit_embeds.view(batch_size, -1, self.llm_hidden_size) | ||
| vit_embeds_list.append(vit_embeds) | ||
| vit_embeds = torch.cat(vit_embeds_list, dim=0) | ||
| return vit_embeds |
There was a problem hiding this comment.
The extract_feature method is missing the application of the mlp1 projection layer. This will result in features with incorrect dimensions (rmsnorm_hidden_size instead of llm_hidden_size), which could lead to runtime errors or incorrect model behavior. Other feature extraction methods in this file, like extract_feature_dynamic and extract_video_feature_temporal, correctly apply this projection. This method should be updated to include the mlp1 projection to ensure feature dimensions are correct.
def extract_feature(self, pixel_values):
micro_batch_size = 128
n = pixel_values.shape[0]
patch_size = self.config.patch_size
h_patches = pixel_values.shape[-2] // patch_size
w_patches = pixel_values.shape[-1] // patch_size
vit_embeds_list = []
for i in range(0, n, micro_batch_size):
chunk = pixel_values[i : i + micro_batch_size]
batch_size = chunk.shape[0]
vit_embeds = self.vision_model(chunk)
vit_embeds = vit_embeds.to(dtype=self.model_dtype)
vit_embeds = vit_embeds.reshape(batch_size, h_patches, w_patches, -1)
vit_embeds = self.pixel_shuffle(
vit_embeds, scale_factor=self.downsample_ratio
)
vit_embeds = vit_embeds.view(-1, self.rmsnorm_hidden_size)
vit_embeds = self.mlp1(vit_embeds)
vit_embeds = vit_embeds.view(batch_size, -1, self.llm_hidden_size)
vit_embeds_list.append(vit_embeds)
vit_embeds = torch.cat(vit_embeds_list, dim=0)
return vit_embeds| return clip_sizes | ||
|
|
||
| def _subsampling_output_length(self, length: int) -> int: | ||
| import math |
|
The CI error is unrelated to this PR. Can we merge it? @mickqian |
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci