CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation (CVPR'26)

Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee

Abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines.

Setup

1. Create the conda environment

conda env create -f environment.yml
conda activate clipoint3d

The environment supports Python 3.9.20 and all dependencies including PyTorch 2.5.1 with CUDA 12.

3. Install Dassl

Follow the installation instructions in Dassl.pytorch/README.md. The relevant steps from their guide are:

cd Dassl.pytorch/

# Install dependencies
pip install -r requirements.txt

# Install this library (no need to re-build if the source code is modified)
python setup.py develop

cd ..

4. CLIP model weights

CLIP weights are downloaded automatically on first use via the clip library. Ensure you have internet access on the first run, or pre-download the ViT-B/16 weights.

Dataset Setup

PointDA

Download the PointDA-10 dataset and place it under PointDA_data/:

PointDA_data/
├── shapenet/
├── modelnet/
└── scannet/

GraspNet

Download the GraspNet point cloud data and place it under GraspNetPointClouds/:

GraspNetPointClouds/
├── synthetic/
├── kinect/
└── realsense/

Training

Single experiment

python train.py \
  --config-file configs/trainers/trainer_200.yaml \
  --dataset-config-file configs/datasets/pointda_shapenet_modelnet.yaml \
  --output-dir experiments/run1 \
  --seed 42 \
  --use_sinkhorn_loss \
  --use_entropy_loss \
  --use_confidence_sampling

Key arguments

Argument	Default	Description
`--config-file`	`configs/trainers/trainer.yaml`	Trainer configuration
`--dataset-config-file`	`configs/datasets/pointda_shapenet_modelnet.yaml`	Dataset configuration
`--output-dir`	`test_runs_with_sinkhorn`	Output directory for checkpoints and logs
`--root`	`PointDA_data`	Path to dataset root
`--seed`	`42`	Random seed (positive = fixed)
`--source-domains`	—	Override source domains
`--target-domains`	—	Override target domains
`--use_sinkhorn_loss`	off	Optimal transport loss between source/target
`--use_entropy_loss`	off	Entropy minimization on target predictions
`--use_align_loss`	off	Direct feature alignment loss
`--use_prototype_loss`	off	Prototype-based domain alignment
`--use_kl_loss`	off	KL divergence loss
`--use_w1_loss`	off	Wasserstein-1 distance loss
`--use_confidence_sampling`	off	Sample target points by prediction confidence

Output is saved to <output-dir>/<model>/<source>/<target>/.

Configuration

Configs use YACS and are split into two files:

Trainer config (configs/trainers/): Model architecture, optimizer, batch size, learning rate, number of context tokens. The recommended config is trainer_200.yaml.
Dataset config (configs/datasets/): Dataset name, source and target domain names. Named as pointda_<source>_<target>.yaml or graspnet_<source>_<target>.yaml.

You can also override any config value directly from the command line using YACS syntax at the end of the command:

python train.py ... OPTIM.LR 0.001 DATALOADER.TRAIN_X.BATCH_SIZE 32

Key trainer config options:

MODEL:
  NAME: CLIPoint3D
  BACKBONE:
    NAME: "ViT-B/16"   # CLIP backbone

OPTIM:
  NAME: "sgd"
  LR: 0.002
  MAX_EPOCH: 200
  LR_SCHEDULER: "cosine"

TRAINER:
  MODEL:
    N_CTX: 4       # Number of learnable context tokens in prompts
    PREC: "fp32"   # Precision: fp32, fp16, or amp

Project Structure

clipoint3d/
├── train.py                  # Entry point
├── trainer.py                # Trainer class with loss implementations
├── environment.yml           # Conda environment spec
├── train_single.sh           # Run all PointDA domain pairs
├── train_graspnet.sh         # Run all GraspNet domain pairs
├── ablations.sh              # Ablation study runs
├── models/
│   ├── model.py              # Main model (PointNet + CLIP + cross-attention)
│   ├── pointnet.py           # PointNet 3D encoder
│   ├── prompt_learner.py     # Learnable text prompt module
│   ├── text_encoder.py       # CLIP text encoder wrapper
│   ├── image_encoder.py      # CLIP image encoder wrapper
│   ├── cross_attention.py    # Cross-modal attention module
│   └── lora.py               # LoRA parameter-efficient fine-tuning
├── clip/                     # CLIP model integration
├── utils/
│   ├── config_defaults.py    # YACS config defaults
│   ├── dataloader.py         # Data loading utilities
│   ├── loss.py               # Domain adaptation loss functions
│   ├── render.py             # Point cloud -> multi-view image renderer
│   └── peft_utils.py         # Parameter-efficient fine-tuning helpers
├── configs/
│   ├── datasets/             # Dataset YAML configs
│   └── trainers/             # Trainer YAML configs
├── Dassl.pytorch/            # Domain adaptation framework
├── PointDA_data/             # PointDA dataset (ShapeNet/ModelNet/ScanNet)
└── GraspNetPointClouds/      # GraspNet dataset

Citation

@article{singha2026clipoint3d,
  title={CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation},
  author={Singha, Mainak and Mehrotra, Sarthak and Casari, Paolo and Chaudhuri, Subhasis and Ricci, Elisa and Banerjee, Biplab},
  journal={arXiv preprint arXiv:2602.20409},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation (CVPR'26)

Abstract

Setup

1. Create the conda environment

3. Install Dassl

4. CLIP model weights

Dataset Setup

PointDA

GraspNet

Training

Single experiment

Key arguments

Configuration

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
Dassl.pytorch		Dassl.pytorch
GraspNetPointClouds		GraspNetPointClouds
PointDA_data		PointDA_data
clip		clip
configs		configs
models		models
static		static
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bitfit_trainer.py		bitfit_trainer.py
environment.yml		environment.yml
index.html		index.html
ln_only_trainer.py		ln_only_trainer.py
train.py		train.py
train_single.sh		train_single.sh
trainer.py		trainer.py

Folders and files

Latest commit

History

Repository files navigation

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation (CVPR'26)

Abstract

Setup

1. Create the conda environment

3. Install Dassl

4. CLIP model weights

Dataset Setup

PointDA

GraspNet

Training

Single experiment

Key arguments

Configuration

Project Structure

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages