Authors: Jing Zhang, Zhikai Li✉, Xuewen Liu, Qingyi Gu✉
(✉ denotes corresponding author.)
This repository contains the official implementation for the ICLR 2026 paper "Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval".
SAM2's perception pattern exhibite computational redundancy. i) The focused attention in mask decoder vs. broad attention span in image encoder shows unnecessary background computation. ii) In memory bank, only a small subset of tokens contribute significantly to memory attention, and the salient regions exhibit temporal consistency.

For image encoder, we introduce object-aware Sparse Window Routing (SWR), which assigns object-irrelevant background windows to a lightweight shortcut branch based on spatial-temporal consistency and perceptual saliency of the object, thus reducing encoding redundancy. For memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which builds a FIFO mask queue to retrieval most salient memory tokens, in which the saliency patterns are reused from their first recollection, thereby reducing the computational cost.

Efficient-SAM2 wins a well-balanced accuracy–speed trade-off.

The code requires python>=3.10, as well as torch>=2.5.1 and torchvision>=0.20.1. Please follow the instructions here to install both PyTorch and TorchVision dependencies. You can install SAM 2 on a GPU machine using:
git clone https://github.com/jingjing0419/Efficient-SAM2.git
cd sam3
pip install -e .To use the SAM 2 predictor and run the example notebooks, jupyter and matplotlib are required and can be installed by:
pip install -e ".[notebooks]"All the model checkpoints can be downloaded by running:
cd checkpoints && \
./download_ckpts.sh && \
cd ..or individually from:
python tools/train_bypass_all.py \
--apply_bypass \
--apply_WB \
--use_wandb \
--train_epoch=5 \
--train_step=32 \
--lr=1e-4 \
--base_video_dir=<PATH-TO-TRAINING-IMAGES> \
--input_mask_dir=<PATH-TO-TRAINING-ANNOTATION> \
--video_list_file=./train_sel_v1.txt \
--output_mask_dir=./outputs/SAV_train/sav_train_pred_pngs \
--dataset='sav_train' \
--sam2_model='base+' \
--bypass_type='bottleneck'The vos_inference_main.py script can be used to generate predictions for semi-supervised video object segmentation (VOS) evaluation on datasets such as DAVIS, MOSE or the SA-V dataset.
After installing SAM 2 and its dependencies, it can be used as follows (DAVIS 2017 dataset as an example). This script saves the prediction PNG files to the --output_mask_dir.
Run Efficient-SAM2 inference:
python tools/vos_inference_main.py \
--sam2_model='base+' --Mem_stride=1 --dataset='SAV_test' \
--apply_bypass --apply_WB --dilate_mask --WB_theta=0.7 \
--bypass_ckpt_base='./bypass/ckpt/bypass_bottleneck_base.pth' \
--prune_memory --topk_mask --set_drop_ratio=0.95 \
--output_mask_dir='./outputs2/'Run SA-V evaluation:
python sav_evaluator.py \
--gt_root <PATH-TO-SAV-TEST/VAL-DATASET-GROUNDTRUTH> \
--pred_root <PATH-TO-MODEL-OUTPUT>Star this repository if you find it helpful!