Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee
Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines.
conda env create -f environment.yml
conda activate clipoint3dThe environment supports Python 3.9.20 and all dependencies including PyTorch 2.5.1 with CUDA 12.
Follow the installation instructions in Dassl.pytorch/README.md. The relevant steps from their guide are:
cd Dassl.pytorch/
# Install dependencies
pip install -r requirements.txt
# Install this library (no need to re-build if the source code is modified)
python setup.py develop
cd ..CLIP weights are downloaded automatically on first use via the clip library. Ensure you have internet access on the first run, or pre-download the ViT-B/16 weights.
Download the PointDA-10 dataset and place it under PointDA_data/:
PointDA_data/
├── shapenet/
├── modelnet/
└── scannet/
Download the GraspNet point cloud data and place it under GraspNetPointClouds/:
GraspNetPointClouds/
├── synthetic/
├── kinect/
└── realsense/
python train.py \
--config-file configs/trainers/trainer_200.yaml \
--dataset-config-file configs/datasets/pointda_shapenet_modelnet.yaml \
--output-dir experiments/run1 \
--seed 42 \
--use_sinkhorn_loss \
--use_entropy_loss \
--use_confidence_sampling| Argument | Default | Description |
|---|---|---|
--config-file |
configs/trainers/trainer.yaml |
Trainer configuration |
--dataset-config-file |
configs/datasets/pointda_shapenet_modelnet.yaml |
Dataset configuration |
--output-dir |
test_runs_with_sinkhorn |
Output directory for checkpoints and logs |
--root |
PointDA_data |
Path to dataset root |
--seed |
42 |
Random seed (positive = fixed) |
--source-domains |
— | Override source domains |
--target-domains |
— | Override target domains |
--use_sinkhorn_loss |
off | Optimal transport loss between source/target |
--use_entropy_loss |
off | Entropy minimization on target predictions |
--use_align_loss |
off | Direct feature alignment loss |
--use_prototype_loss |
off | Prototype-based domain alignment |
--use_kl_loss |
off | KL divergence loss |
--use_w1_loss |
off | Wasserstein-1 distance loss |
--use_confidence_sampling |
off | Sample target points by prediction confidence |
Output is saved to <output-dir>/<model>/<source>/<target>/.
Configs use YACS and are split into two files:
- Trainer config (
configs/trainers/): Model architecture, optimizer, batch size, learning rate, number of context tokens. The recommended config istrainer_200.yaml. - Dataset config (
configs/datasets/): Dataset name, source and target domain names. Named aspointda_<source>_<target>.yamlorgraspnet_<source>_<target>.yaml.
You can also override any config value directly from the command line using YACS syntax at the end of the command:
python train.py ... OPTIM.LR 0.001 DATALOADER.TRAIN_X.BATCH_SIZE 32Key trainer config options:
MODEL:
NAME: CLIPoint3D
BACKBONE:
NAME: "ViT-B/16" # CLIP backbone
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 200
LR_SCHEDULER: "cosine"
TRAINER:
MODEL:
N_CTX: 4 # Number of learnable context tokens in prompts
PREC: "fp32" # Precision: fp32, fp16, or ampclipoint3d/
├── train.py # Entry point
├── trainer.py # Trainer class with loss implementations
├── environment.yml # Conda environment spec
├── train_single.sh # Run all PointDA domain pairs
├── train_graspnet.sh # Run all GraspNet domain pairs
├── ablations.sh # Ablation study runs
├── models/
│ ├── model.py # Main model (PointNet + CLIP + cross-attention)
│ ├── pointnet.py # PointNet 3D encoder
│ ├── prompt_learner.py # Learnable text prompt module
│ ├── text_encoder.py # CLIP text encoder wrapper
│ ├── image_encoder.py # CLIP image encoder wrapper
│ ├── cross_attention.py # Cross-modal attention module
│ └── lora.py # LoRA parameter-efficient fine-tuning
├── clip/ # CLIP model integration
├── utils/
│ ├── config_defaults.py # YACS config defaults
│ ├── dataloader.py # Data loading utilities
│ ├── loss.py # Domain adaptation loss functions
│ ├── render.py # Point cloud -> multi-view image renderer
│ └── peft_utils.py # Parameter-efficient fine-tuning helpers
├── configs/
│ ├── datasets/ # Dataset YAML configs
│ └── trainers/ # Trainer YAML configs
├── Dassl.pytorch/ # Domain adaptation framework
├── PointDA_data/ # PointDA dataset (ShapeNet/ModelNet/ScanNet)
└── GraspNetPointClouds/ # GraspNet dataset
@article{singha2026clipoint3d,
title={CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation},
author={Singha, Mainak and Mehrotra, Sarthak and Casari, Paolo and Chaudhuri, Subhasis and Ricci, Elisa and Banerjee, Biplab},
journal={arXiv preprint arXiv:2602.20409},
year={2026}
}