TVP model by jiqing-feng · Pull Request #25856 · huggingface/transformers

jiqing-feng · 2023-08-30T06:43:08Z

The TVP model was proposed in Text-Visual Prompting for Efficient 2D Temporal Video Grounding by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of this model is to incorporate trainable prompts into both visual inputs and textual features for temporal video grounding(TVG) problems. It was introduced in this paper.

The current model hub has Intel/tvp-base and Intel/tvp-base-ANet.

BTW, the failed checks are weird because I can successfully run pytest under tests/models/tvp locally.

add tokenizer auto fix param in TVPProcessor add docs clear comments and enable different torch dtype add image processor test and model test and fix code style

amyeroberts · 2023-08-30T10:30:12Z

cc @rafaelpadilla for first review :)

rafaelpadilla

Code looks good to me. Just small nits.
I couldn't run the example as Intel/tvp_demo seems to be unavailable.
I also couldn't make a detailed comparison with the original repo, as it is seems not referenced/available yet (noted as [here](xxx))

rafaelpadilla · 2023-08-30T14:37:52Z

docs/source/en/model_doc/tvp.md

+## Overview
+
+The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2209.14156) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
+this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA


Suggested change

this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA

this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding (TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA

rafaelpadilla · 2023-08-30T14:42:04Z

docs/source/en/model_doc/tvp.md

+    Returns:
+        frames (tensor): decoded frames from the video.
+    """
+    assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx)


Suggested change

assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx)

assert clip_idx >= -2, "Not a valid clip_idx {}".format(clip_idx)

docs/source/en/model_doc/tvp.md

tests/models/tvp/test_image_processing_tvp.py

jiqing-feng · 2023-08-31T01:23:03Z

Hi @rafaelpadilla . Thanks for your review!

I have made the Intel/demo dataset public, now you can try again. I also fixed the two grammar error.

However, I think it would be better to keep test_image_processing_tvp.py the same because bridgetower, clip, videomae do the same thing, and I could not ensure that image_utils.OPENAI_CLIP_MEAN will not change.

jiqing-feng · 2023-09-05T04:48:25Z

Hi @rafaelpadilla . I added some examples from charades to my demo datasets, the corresponding text can see here. You can run the script in this PR and replace the video and text to check the performance of this model.

amyeroberts

Thanks for adding this model!

I've done an initial pass, highlighting the main structural parts that stood out and should be revised so that the model is in-line with the rest of the transformers library. In particular:

All custom layers should have the model prefix for their name
Model prefix should be in camel-case i.e. Tvp
All custom layers should take the config as their first argument in their init and use that to instantiate itself as much as possible
All parameters should have clear, explicit name e.g. batch_size over bsz, height instead of h
All models importable from the top level of transformers i.e. from transformers import TvpForXxx should have docstrings for their forward method
Freezing parameters is not something that should be configurable from the config - rather we let users control that after loading the model if they desire.

Overall the code looks nice and clean - we'll probably just have a few rounds of review to make sure the model is library-compatible.

src/transformers/models/tvp/configuration_tvp.py

amyeroberts · 2023-09-04T15:06:05Z

src/transformers/models/tvp/processing_tvp.py

@@ -0,0 +1,136 @@
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team.


Suggested change

# Copyright 2022 The HuggingFace Inc. team.

# Copyright 2023 The HuggingFace Inc. team.

src/transformers/models/tvp/modeling_tvp.py

amyeroberts · 2023-09-04T15:13:50Z

src/transformers/models/tvp/modeling_tvp.py

+]
+
+
+def mish(x):


Is there a difference between this implementation and the one in the ACT2FN implemented in activations.py?

amyeroberts · 2023-09-04T15:16:26Z

tests/models/tvp/test_modeling_tvp.py

+    pipeline_model_mapping = (
+        {
+            "temporal-video-grounding": TVPModel,
+        }
+        if is_torch_available()
+        else {}
+    )


nit: can be put on one line

Suggested change

pipeline_model_mapping = (

{

"temporal-video-grounding": TVPModel,

}

if is_torch_available()

else {}

)

pipeline_model_mapping = {"temporal-video-grounding": TVPModel} if is_torch_available() else {}

amyeroberts · 2023-09-06T11:17:32Z

src/transformers/models/tvp/modeling_tvp.py

+        return out
+
+
+class TVPVisionBackbone(nn.Module):


Is this backbone a standard resnet? In this case it should be loaded using our AutoBackbone API. There's a ResNetBackbone already implemented

Hi @amyeroberts .

I am willing to use ResNetBackbone because it is more convenient. However, the backbone of the original pre-trained TVP checkpoint is slightly different from ResNetBackbone.
See the following images:
TVP backbone:

ResNet backbone

We can see that the stride is different. Moreover, the ResNetBackbone did not support control stride in ResNetEncoder.

There is a possible way, we can change ResNet model to support manipulating stride in the ResNet backbone. Otherwise, we can only keep TVP backbone in our source code.

Would like to hear your opinion. Thx!

You can probaly use:

class ResNetEncoder(nn.Module): def __init__(self, config: ResNetConfig): super().__init__() self.stages = nn.ModuleList([]) # based on `downsample_in_first_stage` the first layer of the first stage may or may not downsample the input self.stages.append( ResNetStage( config, config.embedding_size, config.hidden_sizes[0], stride=2 if config.downsample_in_first_stage else 1, depth=config.depths[0], ) ) in_out_channels = zip(config.hidden_sizes, config.hidden_sizes[1:]) for (in_channels, out_channels), depth in zip(in_out_channels, config.depths[1:]): self.stages.append(ResNetStage(config, in_channels, out_channels, depth=depth))

so setting config.downsample_in_first_stage to the appropriate value no?

Hi @ArthurZucker . Thanks for your advice, but the differences still remain even if I set downsample_in_first_stage=True in config.json. Besides, the differences happen from the second stage, and downsample_in_first_stage only controls the first stage stride.

amyeroberts · 2023-09-06T11:18:15Z

src/transformers/models/tvp/modeling_tvp.py

+        self.stages_and_names = []
+        for i, blocks in enumerate(stages):
+            for block in blocks:
+                assert isinstance(block, TVPVisionBlockBase), block


We don't have asserts in our modeling code - this should be removed or made an exception

amyeroberts · 2023-09-06T11:18:30Z

src/transformers/models/tvp/modeling_tvp.py

+        assert len(self._out_features)
+        children = [x[0] for x in self.named_children()]
+        for out_feature in self._out_features:
+            assert out_feature in children, "Available children: {}".format(", ".join(children))


Same here for asserts

amyeroberts · 2023-09-06T11:20:01Z

src/transformers/models/tvp/modeling_tvp.py

+            if config.backbone_freeze_at >= stage_idx:
+                for block in blocks:
+                    block.freeze()


Suggested change

if config.backbone_freeze_at >= stage_idx:

for block in blocks:

block.freeze()

amyeroberts · 2023-09-06T11:22:44Z

src/transformers/models/tvp/configuration_tvp.py

+        resnets_depth (`int`, *optional*, defaults to 50):
+            The depth of resnets model which is the feature extractor.
+        resnets_num_groups (`int`, *optional*, defaults to 1):
+            The groups of torch.nn.Conv2d.
+        resnets_width_per_group (`int`, *optional*, defaults to 64):
+            The immediate conv layer channels per group.
+        resnets_stem_input_channels (`int`, *optional*, defaults to 3):
+            The input channels of stem module in resnet.
+        resnets_stem_out_channels (`int`, *optional*, defaults to 64):
+            The output channels of stem module and also the input channels of resnet stages.
+        resnets_res_out_channels (`int`, *optional*, defaults to 256):
+            The output channels of resnet.
+        resnets_res_dilation (`int`, *optional*, defaults to 1):


The feature extractor / backbone should be loaded using the Backbone API using a backbone config e.g. like this for DETR.

jiqing-feng · 2023-09-18T14:37:42Z

Hi @amyeroberts . I have fixed most issues, only #25856 (comment) remains to be discussed. Once this issue is done, we can train a small TVP model and start to change the tests.

Thanks for your support!

ArthurZucker · 2023-09-18T23:27:09Z

Hey! @amyeroberts will be OOO for a while 😉 feel free to ping me or @rafaelpadilla if you need help

jiqing-feng · 2023-09-19T02:12:35Z

Hi, @ArthurZucker . I have 3 plans for this issue since downsample_in_first_stage does not work.

Keep the TVP backbone as it has differences from the original ResNet backbone.
Change the ResNet backbone API to support manipulating stride in some layers.
Read the ResNet backbone first, and then change some conv layers' stride in the TVP initialization stage.

Which plan do you prefer? Would like to hear your opinion. Thx!

BTW, I prefer option 3 because it is more convenient, we just need to add the following codes

self.backbone = AutoBackbone.from_config(backbone_config)
if backbone_config.model_type == "resnet" and backbone_config.layer_type == "bottleneck":
    for i in range(1, len(backbone_config.depths)):
        self.backbone.encoder.stages[i].layers[0].layer[0].convolution.stride = (2, 2)
        self.backbone.encoder.stages[i].layers[0].layer[1].convolution.stride = (1, 1)

And I also see detr has the same operation that manipulates some layers in backbone, so I think it is a good way.

…ividual model

ArthurZucker · 2023-09-22T13:08:14Z

Hey! The list of models that have a backbone is the following:

    [
        # Backbone mapping
        ("bit", "BitBackbone"),
        ("convnext", "ConvNextBackbone"),
        ("convnextv2", "ConvNextV2Backbone"),
        ("dinat", "DinatBackbone"),
        ("dinov2", "Dinov2Backbone"),
        ("focalnet", "FocalNetBackbone"),
        ("maskformer-swin", "MaskFormerSwinBackbone"),
        ("nat", "NatBackbone"),
        ("resnet", "ResNetBackbone"),
        ("swin", "SwinBackbone"),
        ("timm_backbone", "TimmBackbone"),
        ("vitdet", "VitDetBackbone"),
    ]

Either there is one that fits your need, or if the changes involved are too big you can go with 3, as there is a precedent for it. Not super fan of 3, and I believe 2 would be better if possible!

jiqing-feng · 2023-09-27T08:25:34Z

Hey! The list of models that have a backbone is the following:
    [
        # Backbone mapping
        ("bit", "BitBackbone"),
        ("convnext", "ConvNextBackbone"),
        ("convnextv2", "ConvNextV2Backbone"),
        ("dinat", "DinatBackbone"),
        ("dinov2", "Dinov2Backbone"),
        ("focalnet", "FocalNetBackbone"),
        ("maskformer-swin", "MaskFormerSwinBackbone"),
        ("nat", "NatBackbone"),
        ("resnet", "ResNetBackbone"),
        ("swin", "SwinBackbone"),
        ("timm_backbone", "TimmBackbone"),
        ("vitdet", "VitDetBackbone"),
    ]
Either there is one that fits your need, or if the changes involved are too big you can go with 3, as there is a precedent for it. Not super fan of 3, and I believe 2 would be better if possible!

Hi @ArthurZucker . Thanks for your advice. Option 2 is now ready to be reviewed. Could you please have a look at 26374. Thx!

ArthurZucker · 2023-09-27T15:04:09Z

Sure reviewing now!

jiqing-feng · 2023-11-06T08:03:04Z

Hi @ArthurZucker @amyeroberts , would you please review the new changes? Thx

jiqing-feng · 2023-11-09T06:44:11Z

Hi @ArthurZucker @amyeroberts . Thanks for your review. Most of the comments have been resolved. There are still 2 main issues:

get_resize_size was a member function in TvpProcessor in the first version but it was rejected by @amyeroberts , so I am not sure where this function should be in.
The sequence length in TVP consists of text input ids length, text prompt length, and visual input length. So I need to change the test_hidden_states and test_attention_outputs

Other issues are related to model doc and arg comments, please see the unresolved conversations. Thx!
BTW, the failed CI are not related to my changes

jiqing-feng · 2023-11-14T01:28:11Z

Hi @ArthurZucker @amyeroberts . The CI seems OK, would you please have a look at these unresolved comments? Thx

amyeroberts

Thanks for iterating!

Mostly small comments - almost there!

amyeroberts · 2023-11-07T17:37:11Z

src/transformers/models/tvp/processing_tvp.py

+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    def post_process_video_grounding(self, outputs, video_durations):


Missing docstring

amyeroberts · 2023-11-07T17:39:14Z

docs/source/en/model_doc/tvp.md

+- This implementation of TVP uses [`BertTokenizer`] to generate text embeddings and Resnet-50 model to compute visual embeddings.
+- Checkpoints for pre-trained [tvp-base](https://huggingface.co/Intel/tvp-base) is released.
+- Please refer to [Table 2](https://arxiv.org/pdf/2303.04995.pdf) for TVP's performance on Temporal Video Grounding task.
+- The PyTorch version of this model is only available in torch 1.10 and higher.


We don't need this as only >= 1.1 is supported in this library

Suggested change

- The PyTorch version of this model is only available in torch 1.10 and higher.

amyeroberts · 2023-11-10T15:38:15Z

src/transformers/models/tvp/modeling_tvp.py

+            losses = ["iou", "distance", "duration"]
+            criterion = TvpLoss(losses)


Suggested change

losses = ["iou", "distance", "duration"]

criterion = TvpLoss(losses)

criterion = TvpLoss(["iou", "distance", "duration"])

amyeroberts · 2023-11-10T15:41:39Z

tests/models/tvp/test_modeling_tvp.py

+        return (
+            TvpImageProcessor.from_pretrained(
+                "Jiqing/tiny-random-tvp",
+            )
+            if is_vision_available()
+            else None
+        )


Suggested change

return (

TvpImageProcessor.from_pretrained(

"Jiqing/tiny-random-tvp",

)

if is_vision_available()

else None

)

return TvpImageProcessor.from_pretrained("Jiqing/tiny-random-tvp") if is_vision_available() else None

amyeroberts · 2023-11-10T19:47:45Z

docs/source/en/model_doc/tvp.md

+
+*In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.*
+
+TVP framework is an effective and efficient framework to train time-efficient 2D TVG models, in which we leverage TVP (text-visual prompting) to improve the utility of sparse 2D visual features without resorting to costly 3D features. To the best of our knowledge, it is the first work to expand the application of prompt learning for resolving TVG problems.


Yes - that's great, thank you :) Could you add this to the model page? Searching I can only find this text in this paragraph

amyeroberts · 2023-11-15T16:14:21Z

src/transformers/models/tvp/modeling_tvp.py

+        self.all_head_size = self.attention_head_size * self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def _shape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):


I realise this is copied from other parts of the library - but let's take the opportunity to use a better name here.

Suggested change

def _shape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):

def _reshape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):

As tensor.shape returns the shape of the tensor tensor.reshape reshapes the tensor

amyeroberts · 2023-11-15T16:20:12Z

src/transformers/models/tvp/modeling_tvp.py

+        attention_output = self.dense(outputs[0])
+        attention_output = self.dropout(attention_output)
+        attention_output = self.layer_norm(attention_output + hidden_states)
+        outputs = (attention_output,) + outputs[1:]  # add attentions if we output them


I don't understand this line. On L363 you create a tuple, and then in this line you always throw away the first element of the tuple. I think you're wanting something like this:

Suggested change

outputs = (attention_output,) + outputs[1:] # add attentions if we output them

outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)

attn_output = self.dense(attn_output)

attn_output = self.dropout(attention_output)

attn_output = self.layer_norm(attn_output + hidden_states)

# add attentions if we output them

outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)

amyeroberts · 2023-11-15T16:20:44Z

src/transformers/models/tvp/modeling_tvp.py

+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads)


We use f-strings rather than old-stype % formatting

amyeroberts · 2023-11-15T16:25:41Z

docs/source/en/model_doc/tvp.md

There was a recent update to the model page docs to standardise the structure - #26876

This page should be update to reflect those

amyeroberts · 2023-11-15T16:30:37Z

docs/source/en/model_doc/tvp.md

+def get_resize_size(image, max_size):
+    '''
+    Args:
+        image: np.ndarray
+        max_size: The max size of height and width
+    Returns:
+        (height, width)
+    Note the height/width order difference >>> pil_img = Image.open("raw_img_tensor.jpg") >>> pil_img.size (640,
+    480) # (width, height) >>> np_img = np.array(pil_img) >>> np_img.shape (480, 640, 3) # (height, width, 3)
+    '''
+    height, width = image.shape[-2:]
+    if height >= width:
+        ratio = width * 1.0 / height
+        new_height = max_size
+        new_width = new_height * ratio
+    else:
+        ratio = height * 1.0 / width
+        new_width = max_size
+        new_height = new_width * ratio
+    size = {"height": int(new_height), "width": int(new_width)}
+    return size


This should be part of the image processor e.g. like here for Vilt

jiqing-feng · 2023-11-20T09:26:07Z

Hi @ArthurZucker @amyeroberts . Thanks for your review, I am clear about all comments now and have fixed all of them. Would you please review these changes? Thx!

amyeroberts

Beautiful ❤️ Thanks for all the work iterating and implementing this!

There's some tiny nits but otherwise all looks good!

amyeroberts · 2023-11-20T16:38:17Z

src/transformers/models/tvp/processing_tvp.py

+
+    @property
+    def model_input_names(self):
+        return ["input_ids", "attention_mask", "pixel_values"]


This should be taken dynamically from the tokenizer and image processor e.g. like here

amyeroberts · 2023-11-20T16:43:51Z

src/transformers/models/tvp/configuration_tvp.py

+        max_grid_col_position_embeddings (`int`, *optional*, defaults to 100):
+            The maximum column sequence length of visual position embeddings.
+        max_grid_row_position_embeddings (`int`, *optional*, defaults to 100):
+            The maximum row sequence length of visual position embeddings.


It would be good to explain what this variable means wrt to the images. For example, max_grid_row_positions I would infer as being the largest number of horizontal patches from a video frame.

amyeroberts · 2023-11-20T16:50:56Z

src/transformers/models/tvp/modeling_tvp.py

+        if self.visual_prompter_apply != "add":
+            visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype)
+            pixel_values *= visual_prompt_mask
+        if self.visual_prompter_apply != "remove":


Reasoning about negative conditions is harder than positive ones.

Raising an exception makes sure invalid values doesn't cause execution as is it was "replace" and provides documentation of expected values.

Exception could also be moved to the __init__ instead

Suggested change

if self.visual_prompter_apply != "add":

visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype)

pixel_values *= visual_prompt_mask

if self.visual_prompter_apply != "remove":

if self.visual_prompter_apply not in ("add", "remove", "replace"):

raise ValueError(f"Invalid visual_prompter_apply value {self.visual_prompter_apply}")

if self.visual_prompter_apply in ("replace", "remove"):

visual_prompt_mask = torch.ones([self.max_img_size, self.max_img_size], dtype=pixel_values.dtype)

pixel_values *= visual_prompt_mask

if self.visual_prompter_apply in ("replace", "add"):

amyeroberts · 2023-11-20T17:00:35Z

@jiqing-feng For the quality checks, there was a recent update and we now use ruff for formatting our code. Could you update the formating libraries and relint? This should make the quality checks pass:

pip uninstall black
pip install -e .[quality]
make fixup

jiqing-feng · 2023-11-21T14:45:33Z

Hi @ArthurZucker @amyeroberts . I think all comments have been fixed, would you please review these changes? Thx!

HuggingFaceDocBuilderDev · 2023-11-21T14:47:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

amyeroberts

Thanks again!

ArthurZucker · 2023-11-21T17:15:00Z

Thanks for this hard work! Congrats on the merge!

jiqing-feng added 7 commits August 14, 2023 21:46

tvp model for video grounding

18b5ece

add tokenizer auto fix param in TVPProcessor add docs clear comments and enable different torch dtype add image processor test and model test and fix code style

merge main

0e5b57e

fix conflict

54bf2b6

fix model doc

54c0f4a

fix image processing tests

a62abf4

fix tvp tests

3e30483

remove torch in processor

7fbf413

rafaelpadilla reviewed Aug 30, 2023

View reviewed changes

fix grammar error

9d27d56

add more details on tvp.md

fdd9b39

amyeroberts reviewed Sep 6, 2023

View reviewed changes

fix model arch for loss, grammar, and processor

1905ded

add docstring and do not regard TvpTransformer, TvpVisionModel as ind…

6be1f69

…ividual model

jiqing-feng force-pushed the tvp branch from 22a403b to 6be1f69 Compare September 20, 2023 06:50

jiqing-feng added 2 commits September 20, 2023 00:10

use pad_image

f86f7c2

update copyright

97e782e

jiqing-feng added 2 commits September 24, 2023 19:51

control first downsample stride

b819daa

Merge branch 'resnet' into tvp

711638a

jiqing-feng mentioned this pull request Sep 25, 2023

Control first downsample stride in ResNet #26374

Merged

reduce first only works for ResNetBottleNeckLayer

fe9712d

fix param name

53668cc

jiqing-feng added 2 commits November 5, 2023 23:14

fix copy

9ae1145

fix style

847f3d5

jiqing-feng requested a review from amyeroberts November 6, 2023 08:00

jiqing-feng added 5 commits November 6, 2023 17:28

update readme

53bd79e

delete print

7c20f9c

remove tvp test_forward_signature

e3105be

remove TvpTransformer

4cbbd68

Merge branch 'main' into tvp

032a39a

jiqing-feng and others added 2 commits November 8, 2023 22:50

fix test init model

6234e3e

Merge branch 'huggingface:main' into tvp

d869f71

amyeroberts reviewed Nov 15, 2023

View reviewed changes

jiqing-feng and others added 5 commits November 17, 2023 09:50

Merge branch 'huggingface:main' into tvp

f6ef725

Merge branch 'main' into tvp

19e85f6

merge main and make style

84a8132

fix tests and others

cf06c12

fix image processor

1d989aa

amyeroberts approved these changes Nov 20, 2023

View reviewed changes

jiqing-feng added 2 commits November 20, 2023 17:40

fix style and model_input_names

065b3dc

fix tests

47071fd

amyeroberts approved these changes Nov 21, 2023

View reviewed changes

amyeroberts merged commit c770600 into huggingface:main Nov 21, 2023

jiqing-feng deleted the tvp branch November 22, 2023 01:48

	this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA
	this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding (TVG) problems, so we need no more compute-heavy 3D visual encoders and still get SOTA

	assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx)
	assert clip_idx >= -2, "Not a valid clip_idx {}".format(clip_idx)

		@@ -0,0 +1,136 @@
		# coding=utf-8
		# Copyright 2022 The HuggingFace Inc. team.

	# Copyright 2022 The HuggingFace Inc. team.
	# Copyright 2023 The HuggingFace Inc. team.

	if config.backbone_freeze_at >= stage_idx:
	for block in blocks:
	block.freeze()

		losses = ["iou", "distance", "duration"]
		criterion = TvpLoss(losses)

	losses = ["iou", "distance", "duration"]
	criterion = TvpLoss(losses)
	criterion = TvpLoss(["iou", "distance", "duration"])


		In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.

		TVP framework is an effective and efficient framework to train time-efficient 2D TVG models, in which we leverage TVP (text-visual prompting) to improve the utility of sparse 2D visual features without resorting to costly 3D features. To the best of our knowledge, it is the first work to expand the application of prompt learning for resolving TVG problems.

	def _shape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):
	def _reshape(self, tensor: torch.Tensor, sequence_length: int, batch_size: int):

-        outputs = (attention_output,) + outputs[1:]  # add attentions if we output them
+        outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
+        attn_output = self.dense(attn_output)
+        attn_output = self.dropout(attention_output)
+        attn_output = self.layer_norm(attn_output + hidden_states)
+        # add attentions if we output them
+        outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)

Conversation

jiqing-feng commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts commented Aug 30, 2023

Uh oh!

rafaelpadilla left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Aug 31, 2023

Uh oh!

jiqing-feng commented Sep 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiqing-feng Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Sep 18, 2023

Uh oh!

ArthurZucker commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Sep 22, 2023

Uh oh!

jiqing-feng commented Sep 27, 2023

Uh oh!

ArthurZucker commented Sep 27, 2023

Uh oh!

jiqing-feng commented Nov 6, 2023

Uh oh!

jiqing-feng commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Aug 30, 2023 •

edited

Loading

rafaelpadilla left a comment •

edited

Loading

jiqing-feng commented Sep 5, 2023 •

edited

Loading

jiqing-feng Sep 19, 2023 •

edited

Loading

ArthurZucker commented Sep 18, 2023 •

edited

Loading

jiqing-feng commented Sep 19, 2023 •

edited

Loading

jiqing-feng commented Nov 9, 2023 •

edited

Loading

jiqing-feng commented Nov 14, 2023 •

edited

Loading

jiqing-feng commented Nov 21, 2023 •

edited

Loading