Skip to content

TVP model#25856

Merged
amyeroberts merged 64 commits intohuggingface:mainfrom
jiqing-feng:tvp
Nov 21, 2023
Merged

TVP model#25856
amyeroberts merged 64 commits intohuggingface:mainfrom
jiqing-feng:tvp

Conversation

@jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Aug 30, 2023

The TVP model was proposed in Text-Visual Prompting for Efficient 2D Temporal Video Grounding by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of this model is to incorporate trainable prompts into both visual inputs and textual features for temporal video grounding(TVG) problems. It was introduced in this paper.

The current model hub has Intel/tvp-base and Intel/tvp-base-ANet.

BTW, the failed checks are weird because I can successfully run pytest under tests/models/tvp locally.

add tokenizer auto

fix param in TVPProcessor

add docs

clear comments and enable different torch dtype

add image processor test and model test and fix code style
@amyeroberts
Copy link
Contributor

cc @rafaelpadilla for first review :)

Copy link
Contributor

@rafaelpadilla rafaelpadilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Just small nits.
I couldn't run the example as Intel/tvp_demo seems to be unavailable.
I also couldn't make a detailed comparison with the original repo, as it is seems not referenced/available yet (noted as [here](xxx))

## Overview

The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2209.14156) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA
this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding (TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA

Returns:
frames (tensor): decoded frames from the video.
"""
assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx)
assert clip_idx >= -2, "Not a valid clip_idx {}".format(clip_idx)

@jiqing-feng
Copy link
Contributor Author

Hi @rafaelpadilla . Thanks for your review!

I have made the Intel/demo dataset public, now you can try again. I also fixed the two grammar error.

However, I think it would be better to keep test_image_processing_tvp.py the same because bridgetower, clip, videomae do the same thing, and I could not ensure that image_utils.OPENAI_CLIP_MEAN will not change.

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Sep 5, 2023

Hi @rafaelpadilla . I added some examples from charades to my demo datasets, the corresponding text can see here. You can run the script in this PR and replace the video and text to check the performance of this model.

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this model!

I've done an initial pass, highlighting the main structural parts that stood out and should be revised so that the model is in-line with the rest of the transformers library. In particular:

  • All custom layers should have the model prefix for their name
  • Model prefix should be in camel-case i.e. Tvp
  • All custom layers should take the config as their first argument in their init and use that to instantiate itself as much as possible
  • All parameters should have clear, explicit name e.g. batch_size over bsz, height instead of h
  • All models importable from the top level of transformers i.e. from transformers import TvpForXxx should have docstrings for their forward method
  • Freezing parameters is not something that should be configurable from the config - rather we let users control that after loading the model if they desire.

Overall the code looks nice and clean - we'll probably just have a few rounds of review to make sure the model is library-compatible.

@@ -0,0 +1,136 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2022 The HuggingFace Inc. team.
# Copyright 2023 The HuggingFace Inc. team.

]


def mish(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between this implementation and the one in the ACT2FN implemented in activations.py?

Comment on lines +224 to +230
pipeline_model_mapping = (
{
"temporal-video-grounding": TVPModel,
}
if is_torch_available()
else {}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be put on one line

Suggested change
pipeline_model_mapping = (
{
"temporal-video-grounding": TVPModel,
}
if is_torch_available()
else {}
)
pipeline_model_mapping = {"temporal-video-grounding": TVPModel} if is_torch_available() else {}

return out


class TVPVisionBackbone(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this backbone a standard resnet? In this case it should be loaded using our AutoBackbone API. There's a ResNetBackbone already implemented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amyeroberts .

I am willing to use ResNetBackbone because it is more convenient. However, the backbone of the original pre-trained TVP checkpoint is slightly different from ResNetBackbone.
See the following images:
TVP backbone:
65c04de8-88a2-41cb-ac42-93b43c79c2ea

ResNet backbone
b5c24ec8-d7e7-4bf6-b5fd-0ba2f6416573

We can see that the stride is different. Moreover, the ResNetBackbone did not support control stride in ResNetEncoder.

There is a possible way, we can change ResNet model to support manipulating stride in the ResNet backbone. Otherwise, we can only keep TVP backbone in our source code.

Would like to hear your opinion. Thx!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probaly use:

class ResNetEncoder(nn.Module):
    def __init__(self, config: ResNetConfig):
        super().__init__()
        self.stages = nn.ModuleList([])
        # based on `downsample_in_first_stage` the first layer of the first stage may or may not downsample the input
        self.stages.append(
            ResNetStage(
                config,
                config.embedding_size,
                config.hidden_sizes[0],
                stride=2 if config.downsample_in_first_stage else 1,
                depth=config.depths[0],
            )
        )
        in_out_channels = zip(config.hidden_sizes, config.hidden_sizes[1:])
        for (in_channels, out_channels), depth in zip(in_out_channels, config.depths[1:]):
            self.stages.append(ResNetStage(config, in_channels, out_channels, depth=depth))

so setting config.downsample_in_first_stage to the appropriate value no?

Copy link
Contributor Author

@jiqing-feng jiqing-feng Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ArthurZucker . Thanks for your advice, but the differences still remain even if I set downsample_in_first_stage=True in config.json. Besides, the differences happen from the second stage, and downsample_in_first_stage only controls the first stage stride.

self.stages_and_names = []
for i, blocks in enumerate(stages):
for block in blocks:
assert isinstance(block, TVPVisionBlockBase), block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have asserts in our modeling code - this should be removed or made an exception

Comment on lines +383 to +386
assert len(self._out_features)
children = [x[0] for x in self.named_children()]
for out_feature in self._out_features:
assert out_feature in children, "Available children: {}".format(", ".join(children))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for asserts

Comment on lines +472 to +474
if config.backbone_freeze_at >= stage_idx:
for block in blocks:
block.freeze()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if config.backbone_freeze_at >= stage_idx:
for block in blocks:
block.freeze()

Comment on lines +32 to +44
resnets_depth (`int`, *optional*, defaults to 50):
The depth of resnets model which is the feature extractor.
resnets_num_groups (`int`, *optional*, defaults to 1):
The groups of torch.nn.Conv2d.
resnets_width_per_group (`int`, *optional*, defaults to 64):
The immediate conv layer channels per group.
resnets_stem_input_channels (`int`, *optional*, defaults to 3):
The input channels of stem module in resnet.
resnets_stem_out_channels (`int`, *optional*, defaults to 64):
The output channels of stem module and also the input channels of resnet stages.
resnets_res_out_channels (`int`, *optional*, defaults to 256):
The output channels of resnet.
resnets_res_dilation (`int`, *optional*, defaults to 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature extractor / backbone should be loaded using the Backbone API using a backbone config e.g. like this for DETR.

@jiqing-feng
Copy link
Contributor Author

Hi @amyeroberts . I have fixed most issues, only #25856 (comment) remains to be discussed. Once this issue is done, we can train a small TVP model and start to change the tests.

Thanks for your support!

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Sep 18, 2023

Hey! @amyeroberts will be OOO for a while 😉 feel free to ping me or @rafaelpadilla if you need help

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Sep 19, 2023

Hi, @ArthurZucker . I have 3 plans for this issue since downsample_in_first_stage does not work.

  1. Keep the TVP backbone as it has differences from the original ResNet backbone.
  2. Change the ResNet backbone API to support manipulating stride in some layers.
  3. Read the ResNet backbone first, and then change some conv layers' stride in the TVP initialization stage.

Which plan do you prefer? Would like to hear your opinion. Thx!

BTW, I prefer option 3 because it is more convenient, we just need to add the following codes

self.backbone = AutoBackbone.from_config(backbone_config)
if backbone_config.model_type == "resnet" and backbone_config.layer_type == "bottleneck":
    for i in range(1, len(backbone_config.depths)):
        self.backbone.encoder.stages[i].layers[0].layer[0].convolution.stride = (2, 2)
        self.backbone.encoder.stages[i].layers[0].layer[1].convolution.stride = (1, 1)

And I also see detr has the same operation that manipulates some layers in backbone, so I think it is a good way.

@ArthurZucker
Copy link
Collaborator

Hey! The list of models that have a backbone is the following:

    [
        # Backbone mapping
        ("bit", "BitBackbone"),
        ("convnext", "ConvNextBackbone"),
        ("convnextv2", "ConvNextV2Backbone"),
        ("dinat", "DinatBackbone"),
        ("dinov2", "Dinov2Backbone"),
        ("focalnet", "FocalNetBackbone"),
        ("maskformer-swin", "MaskFormerSwinBackbone"),
        ("nat", "NatBackbone"),
        ("resnet", "ResNetBackbone"),
        ("swin", "SwinBackbone"),
        ("timm_backbone", "TimmBackbone"),
        ("vitdet", "VitDetBackbone"),
    ]

Either there is one that fits your need, or if the changes involved are too big you can go with 3, as there is a precedent for it. Not super fan of 3, and I believe 2 would be better if possible!

@jiqing-feng
Copy link
Contributor Author

Hey! The list of models that have a backbone is the following:

    [
        # Backbone mapping
        ("bit", "BitBackbone"),
        ("convnext", "ConvNextBackbone"),
        ("convnextv2", "ConvNextV2Backbone"),
        ("dinat", "DinatBackbone"),
        ("dinov2", "Dinov2Backbone"),
        ("focalnet", "FocalNetBackbone"),
        ("maskformer-swin", "MaskFormerSwinBackbone"),
        ("nat", "NatBackbone"),
        ("resnet", "ResNetBackbone"),
        ("swin", "SwinBackbone"),
        ("timm_backbone", "TimmBackbone"),
        ("vitdet", "VitDetBackbone"),
    ]

Either there is one that fits your need, or if the changes involved are too big you can go with 3, as there is a precedent for it. Not super fan of 3, and I believe 2 would be better if possible!

Hi @ArthurZucker . Thanks for your advice. Option 2 is now ready to be reviewed. Could you please have a look at 26374. Thx!

@ArthurZucker
Copy link
Collaborator

Sure reviewing now!

@jiqing-feng
Copy link
Contributor Author

Hi @ArthurZucker @amyeroberts , would you please review the new changes? Thx

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Nov 9, 2023

Hi @ArthurZucker @amyeroberts . Thanks for your review. Most of the comments have been resolved. There are still 2 main issues:

  1. get_resize_size was a member function in TvpProcessor in the first version but it was rejected by @amyeroberts , so I am not sure where this function should be in.
  2. The sequence length in TVP consists of text input ids length, text prompt length, and visual input length. So I need to change the test_hidden_states and test_attention_outputs

Other issues are related to model doc and arg comments, please see the unresolved conversations. Thx!
BTW, the failed CI are not related to my changes

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Nov 14, 2023

Hi @ArthurZucker @amyeroberts . The CI seems OK, would you please have a look at these unresolved comments? Thx

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

Mostly small comments - almost there!

"""
return self.tokenizer.decode(*args, **kwargs)

def post_process_video_grounding(self, outputs, video_durations):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring

- This implementation of TVP uses [`BertTokenizer`] to generate text embeddings and Resnet-50 model to compute visual embeddings.
- Checkpoints for pre-trained [tvp-base](https://huggingface.co/Intel/tvp-base) is released.
- Please refer to [Table 2](https://arxiv.org/pdf/2303.04995.pdf) for TVP's performance on Temporal Video Grounding task.
- The PyTorch version of this model is only available in torch 1.10 and higher.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this as only >= 1.1 is supported in this library

Suggested change
- The PyTorch version of this model is only available in torch 1.10 and higher.

Comment on lines +858 to +859
losses = ["iou", "distance", "duration"]
criterion = TvpLoss(losses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
losses = ["iou", "distance", "duration"]
criterion = TvpLoss(losses)
criterion = TvpLoss(["iou", "distance", "duration"])

Comment on lines +380 to +386
return (
TvpImageProcessor.from_pretrained(
"Jiqing/tiny-random-tvp",
)
if is_vision_available()
else None
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (
TvpImageProcessor.from_pretrained(
"Jiqing/tiny-random-tvp",
)
if is_vision_available()
else None
)
return TvpImageProcessor.from_pretrained("Jiqing/tiny-random-tvp") if is_vision_available() else None


*In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.*

TVP framework is an effective and efficient framework to train time-efficient 2D TVG models, in which we leverage TVP (text-visual prompting) to improve the utility of sparse 2D visual features without resorting to costly 3D features. To the best of our knowledge, it is the first work to expand the application of prompt learning for resolving TVG problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that's great, thank you :) Could you add this to the model page? Searching I can only find this text in this paragraph

self.all_head_size = self.attention_head_size * self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)

def _shape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise this is copied from other parts of the library - but let's take the opportunity to use a better name here.

Suggested change
def _shape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):
def _reshape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):

As tensor.shape returns the shape of the tensor tensor.reshape reshapes the tensor

attention_output = self.dense(outputs[0])
attention_output = self.dropout(attention_output)
attention_output = self.layer_norm(attention_output + hidden_states)
outputs = (attention_output,) + outputs[1:] # add attentions if we output them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this line. On L363 you create a tuple, and then in this line you always throw away the first element of the tuple. I think you're wanting something like this:

Suggested change
outputs = (attention_output,) + outputs[1:] # add attentions if we output them
outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
attn_output = self.dense(attn_output)
attn_output = self.dropout(attention_output)
attn_output = self.layer_norm(attn_output + hidden_states)
# add attentions if we output them
outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)

if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use f-strings rather than old-stype % formatting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a recent update to the model page docs to standardise the structure - #26876

This page should be update to reflect those

Comment on lines +114 to +134
def get_resize_size(image, max_size):
'''
Args:
image: np.ndarray
max_size: The max size of height and width
Returns:
(height, width)
Note the height/width order difference >>> pil_img = Image.open("raw_img_tensor.jpg") >>> pil_img.size (640,
480) # (width, height) >>> np_img = np.array(pil_img) >>> np_img.shape (480, 640, 3) # (height, width, 3)
'''
height, width = image.shape[-2:]
if height >= width:
ratio = width * 1.0 / height
new_height = max_size
new_width = new_height * ratio
else:
ratio = height * 1.0 / width
new_width = max_size
new_height = new_width * ratio
size = {"height": int(new_height), "width": int(new_width)}
return size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of the image processor e.g. like here for Vilt

@jiqing-feng
Copy link
Contributor Author

Hi @ArthurZucker @amyeroberts . Thanks for your review, I am clear about all comments now and have fixed all of them. Would you please review these changes? Thx!

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful ❤️ Thanks for all the work iterating and implementing this!

There's some tiny nits but otherwise all looks good!


@property
def model_input_names(self):
return ["input_ids", "attention_mask", "pixel_values"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be taken dynamically from the tokenizer and image processor e.g. like here

Comment on lines +75 to +78
max_grid_col_position_embeddings (`int`, *optional*, defaults to 100):
The maximum column sequence length of visual position embeddings.
max_grid_row_position_embeddings (`int`, *optional*, defaults to 100):
The maximum row sequence length of visual position embeddings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explain what this variable means wrt to the images. For example, max_grid_row_positions I would infer as being the largest number of horizontal patches from a video frame.

Comment on lines +667 to +670
if self.visual_prompter_apply != "add":
visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype)
pixel_values *= visual_prompt_mask
if self.visual_prompter_apply != "remove":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Reasoning about negative conditions is harder than positive ones.
  • Raising an exception makes sure invalid values doesn't cause execution as is it was "replace" and provides documentation of expected values.

Exception could also be moved to the __init__ instead

Suggested change
if self.visual_prompter_apply != "add":
visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype)
pixel_values *= visual_prompt_mask
if self.visual_prompter_apply != "remove":
if self.visual_prompter_apply not in ("add", "remove", "replace"):
raise ValueError(f"Invalid visual_prompter_apply value {self.visual_prompter_apply}")
if self.visual_prompter_apply in ("replace", "remove"):
visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype)
pixel_values *= visual_prompt_mask
if self.visual_prompter_apply in ("replace", "add"):

@amyeroberts
Copy link
Contributor

@jiqing-feng For the quality checks, there was a recent update and we now use ruff for formatting our code. Could you update the formating libraries and relint? This should make the quality checks pass:

  • pip uninstall black
  • pip install -e .[quality]
  • make fixup

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Nov 21, 2023

Hi @ArthurZucker @amyeroberts . I think all comments have been fixed, would you please review these changes? Thx!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

@amyeroberts amyeroberts merged commit c770600 into huggingface:main Nov 21, 2023
@ArthurZucker
Copy link
Collaborator

Thanks for this hard work! Congrats on the merge!

@jiqing-feng jiqing-feng deleted the tvp branch November 22, 2023 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants