Skip to content

Add DPT#15991

Merged
NielsRogge merged 55 commits intohuggingface:mainfrom
NielsRogge:add_dpt_redesign
Mar 28, 2022
Merged

Add DPT#15991
NielsRogge merged 55 commits intohuggingface:mainfrom
NielsRogge:add_dpt_redesign

Conversation

@NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Mar 8, 2022

What does this PR do?

This PR adds DPT, Dense Prediction Transformers, to the library. It's some very nice work from Intel Labs that applies Transformers for dense prediction tasks such as semantic segmentation and depth estimation.

Feel free to play around with the notebook here.

I've defined 3 models:

  • DPTModel
  • DPTForDepthEstimation
  • DPTForSemanticSegmentation.

DPTModel is the backbone only (ViT in this case). The head models use a neck (DPTNeck) combined with a task-specific head (either DPTDepthEstimationHead or DPTSemanticSegmentationHead).

Important here:

  • a neck is an nn.Module that takes a list of tensors and produces another list of tensors.
  • a head takes a list of tensors and return logits.

To do:

  • make sure heads take a list of tensors as input
  • add tests for DPTFeatureExtractor
  • discuss out_indices and in_index names
  • transfer weights to Intel organization

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 8, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR!

There are still a lot of things to clean up in the modeling file. Make sure to follow the usual style guidelines we have in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it, it looks like a very cool model. There some naming convention that must be applied

  • ***Block/s -> ***Layer/s
  • use Stage when we stack multiple ***Layers together
  • when possible use hidden_size/s instead of channels

Some code can use some refactor to increase readability

Comment on lines +499 to +522
self.use_batch_norm = config.use_batch_norm
self.act1 = nn.ReLU()
self.conv1 = nn.Conv2d(
config.channels,
config.channels,
kernel_size=3,
stride=1,
padding=1,
bias=not self.use_batch_norm,
)

self.act2 = nn.ReLU()
self.conv2 = nn.Conv2d(
config.channels,
config.channels,
kernel_size=3,
stride=1,
padding=1,
bias=not self.use_batch_norm,
)

if self.use_batch_norm:
self.batch_norm1 = nn.BatchNorm2d(config.channels)
self.batch_norm2 = nn.BatchNorm2d(config.channels)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work, keep in mind I am not sure if we always prefer named layers vs unnamed ones (like stacking them in a sequential layer) so take my design comments with a grain of salt

Comment on lines +660 to +688
self.embeddings = DPTViTEmbeddings(config)
self.encoder = DPTViTEncoder(config)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, why is not called DPTViTModel then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple of comments about single variables and naming

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Mar 15, 2022

Thanks for your reviews, addressed most comments. Main thing to update is:

  • rename out_indices (which features to use from the backbone)
  • rename in_index (which features to use in the head)

Ideally we have names that are going to be used by all vision models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Approved, there are still some minor comments about variables names

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a few comments pending from the first review, highlighted them.

The number of output channels for each of the four feature maps of the backbone.
channels (`int`, *optional*, defaults to 256):
The number of channels before fusion.
in_index (`int`, *optional*, defaults to -1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you agreed to use head_in_indices here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not been addressed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one.

@NielsRogge NielsRogge merged commit 979b039 into huggingface:main Mar 28, 2022
@debtriche
Copy link

debtriche commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants