Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, Vikas Chandra
[ 📜 Paper] [ 🤗 HF] [ 📖 BibTeX]
Reference PyTorch implementation and models for EUPE. For details, see the EUPE paper.
Applying our distillation recipe (EUPE) to ViT-B gives a well-balanced universal encoder that excels at
diverse task domains compared to both ViT-B domain experts and existing agglomerative ViT-Bs.
An extended family of versatile efficient vision encoders producing high-quality features and achieving outstanding performance on various vision tasks including image understanding, dense prediction and vision-language modeling.
ℹ️ Please follow the link provided below to get access to all the model weights. These URLs can then be used to download the model weights to a local filesystem and point torch.hub.load() to these local weights via the weights parameters.
See the example code snippets below.
ViT models pretrained on web dataset (LVD-1689M):
| Model | Parameters | Download |
|---|---|---|
| ViT-T/16 | 6M | [link] |
| ViT-S/16 | 21M | [link] |
| ViT-B/16 | 86M | [link] |
ConvNeXt models pretrained on web dataset (LVD-1689M):
| Model | Parameters | Download |
|---|---|---|
| ConvNeXt Tiny | 29M | [link] |
| ConvNeXt Small | 50M | [link] |
| ConvNeXt Base | 89M | [link] |
Pretrained backbones (via PyTorch Hub)
Please follow the instructions here to install PyTorch (the only required dependency for loading the model). Installing PyTorch with CUDA support is strongly recommended.
import torch
REPO_DIR = <PATH/TO/A/LOCAL/DIRECTORY/WHERE/THE/EUPE/REPO/WAS/CLONED>
# EUPE ViT models pretrained on web images
eupe_vitt16 = torch.hub.load(REPO_DIR, 'eupe_vitt16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_vits16 = torch.hub.load(REPO_DIR, 'eupe_vits16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_vitb16 = torch.hub.load(REPO_DIR, 'eupe_vitb16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
# EUPE ConvNeXt models pretrained on web images
eupe_convnext_tiny = torch.hub.load(REPO_DIR, 'eupe_convnext_tiny', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_convnext_small = torch.hub.load(REPO_DIR, 'eupe_convnext_small', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_convnext_base = torch.hub.load(REPO_DIR, 'eupe_convnext_base', source='local', weights=<CHECKPOINT/URL/OR/PATH>)Please use the following transform (standard ImageNet evaluation transform):
import torchvision
from torchvision.transforms import v2
def make_transform(resize_size: int = 256):
to_tensor = v2.ToImage()
resize = v2.Resize((resize_size, resize_size), antialias=True)
to_float = v2.ToDtype(torch.float32, scale=True)
normalize = v2.Normalize(
mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225),
)
return v2.Compose([to_tensor, resize, to_float, normalize])The training and evaluation code requires PyTorch version >= 2.7.1 as well as a few other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below:
micromamba (Recommended) - Clone the repository and then create and activate a eupe conda environment using the provided environment definition:
micromamba env create -f conda.yaml
micromamba activate eupeCreate a folder to host the ADE20K dataset for example:
export DATASETS_ROOT=${HOME}/datasets
mkdir -p ${DATASETS_ROOT}/ADE20K
with-proxy wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip -d ${DATASETS_ROOT}/ADE20K
After untarring the data file, the directory structure should be similar to the following,
the training images:
images/training/ADE_train_00000001.jpg
images/training/ADE_train_00000002.jpg
...
images/training/ADE_train_00020210.jpg
the corresponding annotation masks for the training images:
annotations/training/ADE_train_00000001.png
annotations/training/ADE_train_00000002.png
...
annotations/training/ADE_train_00020210.png
the validation images:
images/validation/ADE_val_00000001.jpg
images/validation/ADE_val_00000002.jpg
...
images/validation/ADE_val_00002000.jpg
the corresponding annotation masks for the validation images:
annotations/validation/ADE_val_00000001.png
annotations/validation/ADE_val_00000002.png
...
annotations/validation/ADE_val_00002000.png
Note: annotations masks contain labels ranging from 0 to 150, where 0 refers to "other objects". We do not consider those pixels in our evaluation.
objectInfo150.txt contains the information about the labels of the 150 semantic categories, including indices, pixel ratios and names.
Create a folder to host the NYU dataset for example:
export DATASETS_ROOT=${HOME}/datasets
mkdir -p ${DATASETS_ROOT}/NYU
We use the NYU subset extracted by BTS from the 120k samples of the original NYU raw dataset.
Please follow BTS instructions to create the dataset:
Make sure you also download the train and test splits:
wget https://github.com/cleinc/bts/blob/master/train_test_inputs/nyudepthv2_train_files_with_gt.txt -O ${DEPTH_DATASETS_ROOT}/NYU/nyu_train.txt
wget https://github.com/cleinc/bts/blob/master/train_test_inputs/nyudepthv2_test_files_with_gt.txt -O ${DEPTH_DATASETS_ROOT}/NYU/nyu_test.txt
Alternatively, one can download the dataset from the following Google Drive link. If the Google Drive link is not available anymore, try Option 1.
Expected contents:
$DEPTH_DATASETS_ROOT/NYU/basement/[...]$DEPTH_DATASETS_ROOT/NYU/basement_0001a/[...]$DEPTH_DATASETS_ROOT/NYU/basement_0001b/[...]$DEPTH_DATASETS_ROOT/NYU/bathroom/[...]$DEPTH_DATASETS_ROOT/NYU/[...]$DEPTH_DATASETS_ROOT/NYU/study_room_0004/[...]$DEPTH_DATASETS_ROOT/NYU/study_room_0005a/[...]$DEPTH_DATASETS_ROOT/NYU/study_room_0005b/[...]$DEPTH_DATASETS_ROOT/NYU/nyu_test.txt$DEPTH_DATASETS_ROOT/NYU/nyu_train.txt
Note: if data is downloaded with Option 2 make sure to rename nyu into NYU.
The root directory of the dataset should hold the following contents:
<ROOT>/test/ILSVRC2012_test_00000001.JPEG<ROOT>/test/[..]<ROOT>/test/ILSVRC2012_test_00100000.JPEG<ROOT>/train/n01440764/n01440764_10026.JPEG<ROOT>/train/[...]<ROOT>/train/n15075141/n15075141_9993.JPEG<ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG<ROOT>/val/[...]<ROOT>/val/n15075141/ILSVRC2012_val_00049174.JPEG<ROOT>/labels.txt
The provided dataset implementation expects a few additional metadata files to be present under the extra directory:
<EXTRA>/class-ids-TRAIN.npy<EXTRA>/class-ids-VAL.npy<EXTRA>/class-names-TRAIN.npy<EXTRA>/class-names-VAL.npy<EXTRA>/entries-TEST.npy<EXTRA>/entries-TRAIN.npy<EXTRA>/entries-VAL.npy
These metadata files can be generated (once) with the following lines of Python code:
from eupe.data.datasets import ImageNet
for split in ImageNet.Split:
dataset = ImageNet(split=split, root="<ROOT>", extra="<EXTRA>")
dataset.dump_extra()Note that the root and extra directories do not have to be distinct directories.
In order to evaluate the model, run the following evaluation on a single node:
PYTHONPATH=. python eupe/eval/segmentation/run.py \
model.eupe_hub=eupe_vitb16 \
model.pretrained_weights=<PATH/TO/CHECKPOINT.pt> \
config=eupe/eval/segmentation/configs/config-ade20k-linear-training.yaml \
datasets.root=<PATH/TO/DATASET> \
output_dir=<PATH/TO/OUTPUT/DIR>After the job completes, you will find in the output path directory you specified
segmentation_config.yamlthat contains the config you trained the model with;model_final.pth, the final linear head checkpoint at the end of training; andresults-semantic-segmentation.csvwith the final metrics.
PYTHONPATH=. python eupe/eval/depth/run.py \
model.eupe_hub=eupe_vitb16 \
model.pretrained_weights=<PATH/TO/CHECKPOINT.pt> \
config=eupe/eval/depth/configs/config-nyu.yaml \
datasets.root=<PATH/TO/DATASET> \
output_dir=<PATH/TO/OUTPUT/DIR>After the job completes, you will find in the output path directory you specified
depth_config.yamlthat contains the config you trained the model with;model_final.pth, the final linear head checkpoint at the end of training; andresults-depth.csvwith the final metrics.
PYTHONPATH=. python -m eupe.run.submit eupe/eval/knn.py \
model.eupe_hub=eupe_vitb16 \
model.pretrained_weights=<PATH/TO/CHECKPOINT.pt> \
output_dir=<PATH/TO/OUTPUT/DIR> \
train.dataset=ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
eval.test_dataset=ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>EUPE code and model weights are released under the FAIR Research License. See LICENSE.md for additional details.
See contributing and the code of conduct.
If you find this repository useful, please consider giving a star ⭐ and citation:
@misc{zhu2026eupe,
title={Efficient Universal Perception Encoder},
author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
year={2026},
eprint={2603.22387},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.22387},
}
This project makes use of the excellent DINOv3 library. We are very grateful for their work.