Skip to content

Support S3D-backed video encoder#104

Closed
sophiazhi wants to merge 13 commits intofacebookresearch:mainfrom
sophiazhi:szhi-s3d
Closed

Support S3D-backed video encoder#104
sophiazhi wants to merge 13 commits intofacebookresearch:mainfrom
sophiazhi:szhi-s3d

Conversation

@sophiazhi
Copy link
Copy Markdown
Contributor

@sophiazhi sophiazhi commented Jun 20, 2022

Summary:
For the S3D-backed video encoder used by MUGEN video-text retrieval, added:

  • examples/mugen/retrieval/s3d.py: S3D model
  • examples/mugen/test/retrieval/test_s3d.py: S3D model unit tests
  • examples/mugen/retrieval/video_clip.py: video encoder
  • examples/mugen/test/retrieval/test_video_clip.py: video encoder tests

Test plan:
For a coverage report, temporarily remove the "examples/" lines from .coveragerc, then run python -m pytest --cov=examples/mugen/retrieval examples/mugen/test/retrieval/test_s3d.py -vv. Otherwise, run python -m pytest examples/mugen/test/retrieval/test_s3d.py::TestS3D -rP.
Similarly,
python -m pytest examples/mugen/test/retrieval/test_video_clip.py::TestVideoCLIP -rP
Screen Shot 2022-06-27 at 11 27 53 AM
Screen Shot 2022-06-27 at 11 28 23 AM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 20, 2022
self.preprocess_std = preprocess_std
self.model = S3D(400)
self.embedding_dim = list(self.model.fc.children())[0].in_channels
self.model.fc = nn.Identity()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider breaking S3D class into trunk and classification head for better composability to avoid model surgery like this. We can discuss this design with TorchVision to see if they could incorporate.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 22, 2022

Codecov Report

Merging #104 (3716aaf) into main (b2e654e) will decrease coverage by 0.64%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
- Coverage   88.39%   87.75%   -0.65%     
==========================================
  Files          35       36       +1     
  Lines        1853     1911      +58     
==========================================
+ Hits         1638     1677      +39     
- Misses        215      234      +19     
Impacted Files Coverage Δ
...multimodal/modules/encoders/mdetr_image_encoder.py 67.24% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2e654e...3716aaf. Read the comment docs.

return x


class Mixed3b(nn.Module):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know the reason for the Mixed names? although I can't think of a clearer name for these...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because they use a mix of 3D and 2D convolutions! paper page 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could rename to MixedConvs3b? your call though

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchvision has similar classes called "Inception{A}{B}..." but over there they assume the conv sub unit is uniform, i.e., 3D or 2D or 1D. Here we use a mixture of 2D and 1D conv sub units in the same module. Eventually, this script should be upstreamed to TorchVision after generalizing their current implementation:

https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L176

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a noob suggestion here, but all these Mixed* classes have very similar structure. Why not create a class per branch (except maybe branch0 since it's just a single op) that you can pass the primitive params (in/out dims, kernel_size, stride) to, then write a single class to handle the torch.cat portion of things? Then you don't have to write the same 4 sequentials a ton of times

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lan and I discussed not generalizing the S3D architecture yet, since that process requires more time to understand the existing torchvision VideoResnet code and generalize S3D in a consistent way (reusing torchvision classes, generalizing torchvision classes, following similar submodule structure). I like your suggestions but for now we didn’t want to generalize this S3D code if it has to be changed later anyway to a possibly different structure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not necessarily thinking about how to generalize the S3D architecture as a whole, more about how to reduce the copypasta in all the Mixed* classes. Seems they could all come from a common class with a different set of params passed, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll try that out in a new commit

return x


class Mixed3b(nn.Module):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a noob suggestion here, but all these Mixed* classes have very similar structure. Why not create a class per branch (except maybe branch0 since it's just a single op) that you can pass the primitive params (in/out dims, kernel_size, stride) to, then write a single class to handle the torch.cat portion of things? Then you don't have to write the same 4 sequentials a ton of times

@sophiazhi sophiazhi marked this pull request as ready for review June 27, 2022 15:03
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sophiazhi sophiazhi mentioned this pull request Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants