🔥Our SyncVIS is accepted by NeurIPS 2024 (poster)! (2024.10)
SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. On this page, we provide further experiments of our approaches and additional visualizations including both specific scenarios and failure cases as well as their analysis.
We list the results of building our method upon other popular VIS methods apart from IDOL and VITA. Worth mentioning, TMT-VIS is mainly designed for training on multiple datasets, and in our experiments, we mainly test the effectiveness of our model when training on a single YTVIS-19 dataset.
Table 1 Experiments on aggregating our design to current VIS methods (ResNet-50)
| Method | AP | Method | AP |
|---|---|---|---|
| Mask2Former | 45.1 | VITA | 49.5 |
| + Synchronized Video-Frame Modeling | 50.3 | + Synchronized Video-Frame Modeling | 53.0 |
| + Synchronized Embedding Optimization | 46.7 | + Synchronized Embedding Optimization | 51.2 |
| + Both (SyncVIS) | 51.5 | + Both (SyncVIS) | 54.2 |
| TMT-VIS | 47.3 | DVIS | 52.6 |
| + Synchronized Video-Frame Modeling | 51.1 | + Synchronized Video-Frame Modeling | 54.9 |
| + Synchronized Embedding Optimization | 48.7 | + Synchronized Embedding Optimization | 54.0 |
| + Both (SyncVIS) | 51.9 | + Both (SyncVIS) | 55.8 |
| GenVIS | 51.3 | IDOL | 49.5 |
| + Synchronized Video-Frame Modeling | 54.4 | + Synchronized Video-Frame Modeling | 55.1 |
| + Synchronized Embedding Optimization | 52.7 | + Synchronized Embedding Optimization | 51.3 |
| + Both (SyncVIS) | 55.4 | + Both (SyncVIS) | 56.5 |
In this part, we present you several cases showing that our model is capable of tracking and segmenting instances with greater velocity. These results demonstrate that with our video-frame synchronization, SyncVIS is able to depict the trajectories and appearances of these fast-moving objects.
We demonstrate that our SyncVIS has the ability of segmenting and tracking fast-moving racing cars with precision and consistency.
We demonstrate that our SyncVIS has the ability of segmenting and tracking fast-moving man skating on his skateboard, segmenting the man's pose and movement with precision and consistency.
As for limitations, our model has problem in segmenting very crowded or heavily occluded scenarios. As shown in the above frames, our model has a problem segmenting the person behind the horseman in the front (but can segment most of the horseman), showing that heavy occlusion remains a vital challenge. However, our model still shows better performance in segmenting complex scenes with multiple instances and occlusions than previous approaches.











