Support S3D-backed video encoder#104
Support S3D-backed video encoder#104sophiazhi wants to merge 13 commits intofacebookresearch:mainfrom
Conversation
examples/s3d/s3d.py
Outdated
| self.preprocess_std = preprocess_std | ||
| self.model = S3D(400) | ||
| self.embedding_dim = list(self.model.fc.children())[0].in_channels | ||
| self.model.fc = nn.Identity() |
There was a problem hiding this comment.
Consider breaking S3D class into trunk and classification head for better composability to avoid model surgery like this. We can discuss this design with TorchVision to see if they could incorporate.
Codecov Report
@@ Coverage Diff @@
## main #104 +/- ##
==========================================
- Coverage 88.39% 87.75% -0.65%
==========================================
Files 35 36 +1
Lines 1853 1911 +58
==========================================
+ Hits 1638 1677 +39
- Misses 215 234 +19
Continue to review full report at Codecov.
|
| return x | ||
|
|
||
|
|
||
| class Mixed3b(nn.Module): |
There was a problem hiding this comment.
do you know the reason for the Mixed names? although I can't think of a clearer name for these...
There was a problem hiding this comment.
because they use a mix of 3D and 2D convolutions! paper page 3
There was a problem hiding this comment.
maybe you could rename to MixedConvs3b? your call though
There was a problem hiding this comment.
torchvision has similar classes called "Inception{A}{B}..." but over there they assume the conv sub unit is uniform, i.e., 3D or 2D or 1D. Here we use a mixture of 2D and 1D conv sub units in the same module. Eventually, this script should be upstreamed to TorchVision after generalizing their current implementation:
https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L176
There was a problem hiding this comment.
Possibly a noob suggestion here, but all these Mixed* classes have very similar structure. Why not create a class per branch (except maybe branch0 since it's just a single op) that you can pass the primitive params (in/out dims, kernel_size, stride) to, then write a single class to handle the torch.cat portion of things? Then you don't have to write the same 4 sequentials a ton of times
There was a problem hiding this comment.
Lan and I discussed not generalizing the S3D architecture yet, since that process requires more time to understand the existing torchvision VideoResnet code and generalize S3D in a consistent way (reusing torchvision classes, generalizing torchvision classes, following similar submodule structure). I like your suggestions but for now we didn’t want to generalize this S3D code if it has to be changed later anyway to a possibly different structure.
There was a problem hiding this comment.
I am not necessarily thinking about how to generalize the S3D architecture as a whole, more about how to reduce the copypasta in all the Mixed* classes. Seems they could all come from a common class with a different set of params passed, no?
There was a problem hiding this comment.
Sure I'll try that out in a new commit
| return x | ||
|
|
||
|
|
||
| class Mixed3b(nn.Module): |
There was a problem hiding this comment.
Possibly a noob suggestion here, but all these Mixed* classes have very similar structure. Why not create a class per branch (except maybe branch0 since it's just a single op) that you can pass the primitive params (in/out dims, kernel_size, stride) to, then write a single class to handle the torch.cat portion of things? Then you don't have to write the same 4 sequentials a ton of times
|
@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary:
For the S3D-backed video encoder used by MUGEN video-text retrieval, added:
examples/mugen/retrieval/s3d.py: S3D modelexamples/mugen/test/retrieval/test_s3d.py: S3D model unit testsexamples/mugen/retrieval/video_clip.py: video encoderexamples/mugen/test/retrieval/test_video_clip.py: video encoder testsTest plan:


For a coverage report, temporarily remove the "examples/" lines from
.coveragerc, then runpython -m pytest --cov=examples/mugen/retrieval examples/mugen/test/retrieval/test_s3d.py -vv. Otherwise, runpython -m pytest examples/mugen/test/retrieval/test_s3d.py::TestS3D -rP.Similarly,
python -m pytest examples/mugen/test/retrieval/test_video_clip.py::TestVideoCLIP -rP