Conversation
add tokenizer auto fix param in TVPProcessor add docs clear comments and enable different torch dtype add image processor test and model test and fix code style
|
cc @rafaelpadilla for first review :) |
docs/source/en/model_doc/tvp.md
Outdated
| ## Overview | ||
|
|
||
| The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2209.14156) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of | ||
| this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA |
There was a problem hiding this comment.
| this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA | |
| this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding (TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA |
docs/source/en/model_doc/tvp.md
Outdated
| Returns: | ||
| frames (tensor): decoded frames from the video. | ||
| """ | ||
| assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx) |
There was a problem hiding this comment.
| assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx) | |
| assert clip_idx >= -2, "Not a valid clip_idx {}".format(clip_idx) |
|
Hi @rafaelpadilla . Thanks for your review! I have made the Intel/demo dataset public, now you can try again. I also fixed the two grammar error. However, I think it would be better to keep test_image_processing_tvp.py the same because bridgetower, clip, videomae do the same thing, and I could not ensure that |
|
Hi @rafaelpadilla . I added some examples from charades to my demo datasets, the corresponding text can see here. You can run the script in this PR and replace the video and text to check the performance of this model. |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for adding this model!
I've done an initial pass, highlighting the main structural parts that stood out and should be revised so that the model is in-line with the rest of the transformers library. In particular:
- All custom layers should have the model prefix for their name
- Model prefix should be in camel-case i.e.
Tvp - All custom layers should take the config as their first argument in their init and use that to instantiate itself as much as possible
- All parameters should have clear, explicit name e.g.
batch_sizeoverbsz,heightinstead ofh - All models importable from the top level of transformers i.e.
from transformers import TvpForXxxshould have docstrings for their forward method - Freezing parameters is not something that should be configurable from the config - rather we let users control that after loading the model if they desire.
Overall the code looks nice and clean - we'll probably just have a few rounds of review to make sure the model is library-compatible.
| @@ -0,0 +1,136 @@ | |||
| # coding=utf-8 | |||
| # Copyright 2022 The HuggingFace Inc. team. | |||
There was a problem hiding this comment.
| # Copyright 2022 The HuggingFace Inc. team. | |
| # Copyright 2023 The HuggingFace Inc. team. |
| ] | ||
|
|
||
|
|
||
| def mish(x): |
There was a problem hiding this comment.
Is there a difference between this implementation and the one in the ACT2FN implemented in activations.py?
| pipeline_model_mapping = ( | ||
| { | ||
| "temporal-video-grounding": TVPModel, | ||
| } | ||
| if is_torch_available() | ||
| else {} | ||
| ) |
There was a problem hiding this comment.
nit: can be put on one line
| pipeline_model_mapping = ( | |
| { | |
| "temporal-video-grounding": TVPModel, | |
| } | |
| if is_torch_available() | |
| else {} | |
| ) | |
| pipeline_model_mapping = {"temporal-video-grounding": TVPModel} if is_torch_available() else {} |
| return out | ||
|
|
||
|
|
||
| class TVPVisionBackbone(nn.Module): |
There was a problem hiding this comment.
Is this backbone a standard resnet? In this case it should be loaded using our AutoBackbone API. There's a ResNetBackbone already implemented
There was a problem hiding this comment.
Hi @amyeroberts .
I am willing to use ResNetBackbone because it is more convenient. However, the backbone of the original pre-trained TVP checkpoint is slightly different from ResNetBackbone.
See the following images:
TVP backbone:

We can see that the stride is different. Moreover, the ResNetBackbone did not support control stride in ResNetEncoder.
There is a possible way, we can change ResNet model to support manipulating stride in the ResNet backbone. Otherwise, we can only keep TVP backbone in our source code.
Would like to hear your opinion. Thx!
There was a problem hiding this comment.
You can probaly use:
class ResNetEncoder(nn.Module):
def __init__(self, config: ResNetConfig):
super().__init__()
self.stages = nn.ModuleList([])
# based on `downsample_in_first_stage` the first layer of the first stage may or may not downsample the input
self.stages.append(
ResNetStage(
config,
config.embedding_size,
config.hidden_sizes[0],
stride=2 if config.downsample_in_first_stage else 1,
depth=config.depths[0],
)
)
in_out_channels = zip(config.hidden_sizes, config.hidden_sizes[1:])
for (in_channels, out_channels), depth in zip(in_out_channels, config.depths[1:]):
self.stages.append(ResNetStage(config, in_channels, out_channels, depth=depth))so setting config.downsample_in_first_stage to the appropriate value no?
There was a problem hiding this comment.
Hi @ArthurZucker . Thanks for your advice, but the differences still remain even if I set downsample_in_first_stage=True in config.json. Besides, the differences happen from the second stage, and downsample_in_first_stage only controls the first stage stride.
| self.stages_and_names = [] | ||
| for i, blocks in enumerate(stages): | ||
| for block in blocks: | ||
| assert isinstance(block, TVPVisionBlockBase), block |
There was a problem hiding this comment.
We don't have asserts in our modeling code - this should be removed or made an exception
| assert len(self._out_features) | ||
| children = [x[0] for x in self.named_children()] | ||
| for out_feature in self._out_features: | ||
| assert out_feature in children, "Available children: {}".format(", ".join(children)) |
| if config.backbone_freeze_at >= stage_idx: | ||
| for block in blocks: | ||
| block.freeze() |
There was a problem hiding this comment.
| if config.backbone_freeze_at >= stage_idx: | |
| for block in blocks: | |
| block.freeze() |
| resnets_depth (`int`, *optional*, defaults to 50): | ||
| The depth of resnets model which is the feature extractor. | ||
| resnets_num_groups (`int`, *optional*, defaults to 1): | ||
| The groups of torch.nn.Conv2d. | ||
| resnets_width_per_group (`int`, *optional*, defaults to 64): | ||
| The immediate conv layer channels per group. | ||
| resnets_stem_input_channels (`int`, *optional*, defaults to 3): | ||
| The input channels of stem module in resnet. | ||
| resnets_stem_out_channels (`int`, *optional*, defaults to 64): | ||
| The output channels of stem module and also the input channels of resnet stages. | ||
| resnets_res_out_channels (`int`, *optional*, defaults to 256): | ||
| The output channels of resnet. | ||
| resnets_res_dilation (`int`, *optional*, defaults to 1): |
There was a problem hiding this comment.
The feature extractor / backbone should be loaded using the Backbone API using a backbone config e.g. like this for DETR.
|
Hi @amyeroberts . I have fixed most issues, only #25856 (comment) remains to be discussed. Once this issue is done, we can train a small TVP model and start to change the tests. Thanks for your support! |
|
Hey! @amyeroberts will be OOO for a while 😉 feel free to ping me or @rafaelpadilla if you need help |
|
Hi, @ArthurZucker . I have 3 plans for this issue since
Which plan do you prefer? Would like to hear your opinion. Thx! BTW, I prefer option 3 because it is more convenient, we just need to add the following codes self.backbone = AutoBackbone.from_config(backbone_config)
if backbone_config.model_type == "resnet" and backbone_config.layer_type == "bottleneck":
for i in range(1, len(backbone_config.depths)):
self.backbone.encoder.stages[i].layers[0].layer[0].convolution.stride = (2, 2)
self.backbone.encoder.stages[i].layers[0].layer[1].convolution.stride = (1, 1)And I also see detr has the same operation that manipulates some layers in backbone, so I think it is a good way. |
|
Hey! The list of models that have a backbone is the following: [
# Backbone mapping
("bit", "BitBackbone"),
("convnext", "ConvNextBackbone"),
("convnextv2", "ConvNextV2Backbone"),
("dinat", "DinatBackbone"),
("dinov2", "Dinov2Backbone"),
("focalnet", "FocalNetBackbone"),
("maskformer-swin", "MaskFormerSwinBackbone"),
("nat", "NatBackbone"),
("resnet", "ResNetBackbone"),
("swin", "SwinBackbone"),
("timm_backbone", "TimmBackbone"),
("vitdet", "VitDetBackbone"),
]Either there is one that fits your need, or if the changes involved are too big you can go with 3, as there is a precedent for it. Not super fan of 3, and I believe 2 would be better if possible! |
Hi @ArthurZucker . Thanks for your advice. Option 2 is now ready to be reviewed. Could you please have a look at 26374. Thx! |
|
Sure reviewing now! |
|
Hi @ArthurZucker @amyeroberts , would you please review the new changes? Thx |
|
Hi @ArthurZucker @amyeroberts . Thanks for your review. Most of the comments have been resolved. There are still 2 main issues:
Other issues are related to model doc and arg comments, please see the unresolved conversations. Thx! |
|
Hi @ArthurZucker @amyeroberts . The CI seems OK, would you please have a look at these unresolved comments? Thx |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for iterating!
Mostly small comments - almost there!
| """ | ||
| return self.tokenizer.decode(*args, **kwargs) | ||
|
|
||
| def post_process_video_grounding(self, outputs, video_durations): |
docs/source/en/model_doc/tvp.md
Outdated
| - This implementation of TVP uses [`BertTokenizer`] to generate text embeddings and Resnet-50 model to compute visual embeddings. | ||
| - Checkpoints for pre-trained [tvp-base](https://huggingface.co/Intel/tvp-base) is released. | ||
| - Please refer to [Table 2](https://arxiv.org/pdf/2303.04995.pdf) for TVP's performance on Temporal Video Grounding task. | ||
| - The PyTorch version of this model is only available in torch 1.10 and higher. |
There was a problem hiding this comment.
We don't need this as only >= 1.1 is supported in this library
| - The PyTorch version of this model is only available in torch 1.10 and higher. |
| losses = ["iou", "distance", "duration"] | ||
| criterion = TvpLoss(losses) |
There was a problem hiding this comment.
| losses = ["iou", "distance", "duration"] | |
| criterion = TvpLoss(losses) | |
| criterion = TvpLoss(["iou", "distance", "duration"]) |
| return ( | ||
| TvpImageProcessor.from_pretrained( | ||
| "Jiqing/tiny-random-tvp", | ||
| ) | ||
| if is_vision_available() | ||
| else None | ||
| ) |
There was a problem hiding this comment.
| return ( | |
| TvpImageProcessor.from_pretrained( | |
| "Jiqing/tiny-random-tvp", | |
| ) | |
| if is_vision_available() | |
| else None | |
| ) | |
| return TvpImageProcessor.from_pretrained("Jiqing/tiny-random-tvp") if is_vision_available() else None |
docs/source/en/model_doc/tvp.md
Outdated
|
|
||
| *In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.* | ||
|
|
||
| TVP framework is an effective and efficient framework to train time-efficient 2D TVG models, in which we leverage TVP (text-visual prompting) to improve the utility of sparse 2D visual features without resorting to costly 3D features. To the best of our knowledge, it is the first work to expand the application of prompt learning for resolving TVG problems. |
There was a problem hiding this comment.
Yes - that's great, thank you :) Could you add this to the model page? Searching I can only find this text in this paragraph
| self.all_head_size = self.attention_head_size * self.num_attention_heads | ||
| self.pruned_heads = self.pruned_heads.union(heads) | ||
|
|
||
| def _shape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int): |
There was a problem hiding this comment.
I realise this is copied from other parts of the library - but let's take the opportunity to use a better name here.
| def _shape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int): | |
| def _reshape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int): |
As tensor.shape returns the shape of the tensor tensor.reshape reshapes the tensor
| attention_output = self.dense(outputs[0]) | ||
| attention_output = self.dropout(attention_output) | ||
| attention_output = self.layer_norm(attention_output + hidden_states) | ||
| outputs = (attention_output,) + outputs[1:] # add attentions if we output them |
There was a problem hiding this comment.
I don't understand this line. On L363 you create a tuple, and then in this line you always throw away the first element of the tuple. I think you're wanting something like this:
| outputs = (attention_output,) + outputs[1:] # add attentions if we output them | |
| outputs = (attn_output, attention_probs) if output_attentions else (attn_output,) | |
| attn_output = self.dense(attn_output) | |
| attn_output = self.dropout(attention_output) | |
| attn_output = self.layer_norm(attn_output + hidden_states) | |
| # add attentions if we output them | |
| outputs = (attn_output, attention_probs) if output_attentions else (attn_output,) |
| if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"): | ||
| raise ValueError( | ||
| "The hidden size (%d) is not a multiple of the number of attention " | ||
| "heads (%d)" % (config.hidden_size, config.num_attention_heads) |
There was a problem hiding this comment.
We use f-strings rather than old-stype % formatting
There was a problem hiding this comment.
There was a recent update to the model page docs to standardise the structure - #26876
This page should be update to reflect those
docs/source/en/model_doc/tvp.md
Outdated
| def get_resize_size(image, max_size): | ||
| ''' | ||
| Args: | ||
| image: np.ndarray | ||
| max_size: The max size of height and width | ||
| Returns: | ||
| (height, width) | ||
| Note the height/width order difference >>> pil_img = Image.open("raw_img_tensor.jpg") >>> pil_img.size (640, | ||
| 480) # (width, height) >>> np_img = np.array(pil_img) >>> np_img.shape (480, 640, 3) # (height, width, 3) | ||
| ''' | ||
| height, width = image.shape[-2:] | ||
| if height >= width: | ||
| ratio = width * 1.0 / height | ||
| new_height = max_size | ||
| new_width = new_height * ratio | ||
| else: | ||
| ratio = height * 1.0 / width | ||
| new_width = max_size | ||
| new_height = new_width * ratio | ||
| size = {"height": int(new_height), "width": int(new_width)} | ||
| return size |
There was a problem hiding this comment.
This should be part of the image processor e.g. like here for Vilt
|
Hi @ArthurZucker @amyeroberts . Thanks for your review, I am clear about all comments now and have fixed all of them. Would you please review these changes? Thx! |
amyeroberts
left a comment
There was a problem hiding this comment.
Beautiful ❤️ Thanks for all the work iterating and implementing this!
There's some tiny nits but otherwise all looks good!
|
|
||
| @property | ||
| def model_input_names(self): | ||
| return ["input_ids", "attention_mask", "pixel_values"] |
There was a problem hiding this comment.
This should be taken dynamically from the tokenizer and image processor e.g. like here
| max_grid_col_position_embeddings (`int`, *optional*, defaults to 100): | ||
| The maximum column sequence length of visual position embeddings. | ||
| max_grid_row_position_embeddings (`int`, *optional*, defaults to 100): | ||
| The maximum row sequence length of visual position embeddings. |
There was a problem hiding this comment.
It would be good to explain what this variable means wrt to the images. For example, max_grid_row_positions I would infer as being the largest number of horizontal patches from a video frame.
| if self.visual_prompter_apply != "add": | ||
| visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype) | ||
| pixel_values *= visual_prompt_mask | ||
| if self.visual_prompter_apply != "remove": |
There was a problem hiding this comment.
- Reasoning about negative conditions is harder than positive ones.
- Raising an exception makes sure invalid values doesn't cause execution as is it was "replace" and provides documentation of expected values.
Exception could also be moved to the __init__ instead
| if self.visual_prompter_apply != "add": | |
| visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype) | |
| pixel_values *= visual_prompt_mask | |
| if self.visual_prompter_apply != "remove": | |
| if self.visual_prompter_apply not in ("add", "remove", "replace"): | |
| raise ValueError(f"Invalid visual_prompter_apply value {self.visual_prompter_apply}") | |
| if self.visual_prompter_apply in ("replace", "remove"): | |
| visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype) | |
| pixel_values *= visual_prompt_mask | |
| if self.visual_prompter_apply in ("replace", "add"): |
|
@jiqing-feng For the quality checks, there was a recent update and we now use
|
|
Hi @ArthurZucker @amyeroberts . I think all comments have been fixed, would you please review these changes? Thx! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Thanks for this hard work! Congrats on the merge! |

The TVP model was proposed in Text-Visual Prompting for Efficient 2D Temporal Video Grounding by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of this model is to incorporate trainable prompts into both visual inputs and textual features for temporal video grounding(TVG) problems. It was introduced in this paper.
The current model hub has Intel/tvp-base and Intel/tvp-base-ANet.
BTW, the failed checks are weird because I can successfully run
pytestundertests/models/tvplocally.