- Clone this repo
git clone --recursive git@github.com:3dlg-hcvc/EgoFun3D.git
- Prepare a conda environment
There are many different modules in this project. Thus, pip will notify you several incompatible issues during installation. Usually this is not a big problem. Make sure several key packages satisfy the following version.
cd EgoFun3D bash install.shtorch==2.9.1 transformers==4.57.6 vllm==0.15.1 numpy==1.26.4
- Download EgoFun3D dataset.
hf download 3dlg-hcvc/EgoFun3D --repo-type dataset --local-dir full_dataset
- Setup environment variable.
export GEMINI_API_KEY=$YOUR_GEMINI_API_KEY export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY export VLLM_WORKER_MULTIPROC_METHOD=spawn
The segmentation flow is:
- Run the selected segmentation model on 20 seed frames.
- Propagate those masks to all frames with SAM3.
- Save one mask archive per role in
segmentation_masks.h5. - Use those saved results in the downstream evaluation scripts.
Use eval_segmentation.py for the staged release workflow. By default the release segmentation configs use segmentation.frame_subsample=20 and segmentation.propagate_with_sam3=true.
Example:
python eval_segmentation.py segmentation=VisionReasoner vlm_segmentation=gemini_segmentationIf you want to run segmentation with ground-truth part labels instead of VLM-predicted labels:
python eval_segmentation.py gt_labels=true segmentation=VisionReasonerIf you want Gemini part labels precomputed instead of queried during segmentation, first cache the VLM output:
python eval_segmentation.py \
vlm_only=true \
save_shared_vlm=true \
segmentation=VisionReasoner \
vlm_segmentation=gemini_segmentationThen run segmentation from the cached labels:
python eval_segmentation.py \
from_shared_vlm=true \
disable_vlm_calls=true \
segmentation=VisionReasoner \
vlm_segmentation=gemini_segmentationTo run SAM3 Agent segmentation, you need to run vllm first, and then run the segmentation script. We recommend to run VLM on one GPU and segmentation on another GPU to prevent OOM.
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-VL-8B-Thinking --max-model-len 65536 --port 8001 &
VLLM_PID=$!
# Wait for the server to be ready
until curl -s http://localhost:8001/health > /dev/null 2>&1; do
echo "Waiting for vLLM server to start..."
sleep 10
done
echo "vLLM server is ready!"
python eval_segmentation.py save_shared_vlm=true segmentation=SAM3Agent
kill $VLLM_PID
wait $VLLM_PID 2>/dev/nullResults are saved under outputs/{exp_name}/{time}/{video_name}/segmentation/.
The reconstruction flow is:
- Load 2D segmentation results from the previous step or from the dataset.
- Running reconstruction on the input videos.
- Aligning moving parts to the initial state using RoMa.
- Build meshes.
To run reconstruction on the ground truth 2D segmentation
python eval_reconstruction.py reconstruction=da3To run reconstruction on the predicted 2D segmentation
python eval_reconstruction.py reconstruction=da3 pred_mask=True segmentation_results_dir={YOUR SEGMENTATION RESULTS PATH}You can switch reconstruction method to mapanything or vipe.
Results are saved under outputs/{exp_name}/{time}/{video_name}/reconstruction/.
The articulation estimation will take the reconstruction results as input and estimate articulation parameters. Thus, please run reconstruction before running articulation estimation.
To run articulation estimation on the ground truth 2D segmentation
python eval_articulation.py articulation=iTACO reconstruction_results_dir={YOUR RECONSTRUCTION RESULTS PATH}Similarly, to run articulation estimation on the predicted 2D segmentation
python eval_articulation.py articulation=iTACO reconstruction_results_dir={YOUR RECONSTRUCTION RESULTS PATH} pred_mask=True segmentation_results_dir={YOUR SEGMENTATION RESULTS PATH}You can switch articulation method to Artipoint
Results are saved under outputs/{exp_name}/{time}/{video_name}/articulation/.
The function template prediction first marks the receptor and effector in different colors on the original video and query VLMs for function template prediction.
python eval_function.py vlm_function=gemini_functionYou can switch articulation method to gpt_function, qwen_function, or molmo_function.
Results are saved under outputs/{exp_name}/{time}/{video_name}/function/.
pipeline.py runs the complete pipeline on an arbitrary video file with no UI.
Part labels are auto-detected by a Qwen VLM if not supplied; pass --gemini_key to use Gemini instead.
python pipeline.py \
--video /path/to/video.mp4 \
--output_dir /path/to/outputsSupply part labels directly to skip VLM auto-detection:
python pipeline.py \
--video /path/to/video.mp4 \
--output_dir /path/to/outputs \
--receptor "faucet handle" \
--effector "faucet spout"Override the segmentation config (default: VisionReasoner) or VLM function config:
python pipeline.py \
--video /path/to/video.mp4 \
--output_dir /path/to/outputs \
--seg_config config/segmentation/MolmoSAM.yaml \
--vlm_function_config config/vlm_function/gemini_function.yamlOutputs follow the evaluation suite layout (reconstruction/, articulation/, function/, compile/).
We also provide a Gradio interface to run the full pipeline interactively.
python gradio.pyOnce we get all necessary results from previous steps, we can compile the executable function script to run in physics simulators. Currently we support compiling fluid and geometry functions. The fluid function executes in Genesis and the geometry function executes in Isaac Sim.
For Genesis, simply run
pip install genesis-world==0.4.4For Isaac Sim, we run it through Isaac Lab. Therefore, please follow Isaac Lab Installation Guidance to prepare Isaac Lab environment. We suggest you creating another conda environment to run Isaac Lab.
To compile executable function script, run
python compile/compile.py --reconstruction_dir {PATH TO RECONSTRUCTION DIR} --articulation_dir {PATH TO ARTICULATION DIR} --function_dir {PATH TO FUNCTION DIR} --output_dir {PATH TO EXECUTABLE FUNCTION SCRIPT OUTPUT DIR}Please refer to compile/compile.py for more information on input parameters.
After getting the URDF and function script, for fluid function, you can simply run the script. For geometry function, you can move the URDF and the script to the IsaacLab/scripts/tutorials/01_assets/ folder, then convert URDF to USD by running
./isaaclab.sh -p scripts/tools/convert_urdf.py PATH_TO_URDF OUTPUT_PATH_TO_USDand also modify the USD path in the function script. Finally, you can test the function script by running
./isaaclab.sh -p scripts/tutorials/01_assets/{YOUR_SCRIPT_NAME}This work was funded in part by a Canada Research Chair, NSERC Discovery Grant, and enabled by support from the Digital Research Alliance of Canada. The authors would like to thank Tianrun Hu from National University of Singapore for collecting data, Jiayi Liu, Xingguang Yan, Austin T. Wang, and Morteza Badali for valuable discussions and proofreading.
This codebase is built on top of VisionReasoner, Sa2VA, X-SAM, Molmo2, Qwen3-VL, SAM2, SAM3, Depth-Anything-3, Map-Anything, ViPE, Artipoint, iTACO. We thank the authors for open sourcing these invaluable projects.
If you find our project to be useful, please cite our paper
@article{peng2026egofun3d,
title={{EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates}},
author={Peng, Weikun and Iliash, Denys and Savva, Manolis},
journal={arXiv preprint arXiv:2604.11038},
year={2026}
}