COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

Baiyu Chen^1,2, Wilson Wongso^1,2, Zechen Li¹, Yonchanok Khaokaew^1,2, Hao Xue^1,2, and Flora Salim^1,2

¹ School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
² ARC Centre of Excellence for Automated Decision Making + Society

📢 News

05/2026: Excited to release our latest work AnyMo, a comprehensive framework for wearable motion understanding, covering synthetic IMU generation, geometry-aware pre-training, motion-language alignment, data resources, and a new benchmark.
05/2026: Excited to share ZARA — Zero-training Activity Reasoning Agents, a training-free, evidence-grounded LLM agent framework for motion time-series reasoning — accepted as an ACL 2026 Oral paper!

🌟 Overview

COMODO is an open source framework for Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition.

🔑 The key features of COMODO:

Self-supervised Cross-modal Knowledge Transfer: We propose COMODO, a cross-modal self-supervised distillation framework that leverages pretrained video and time-series models enabling label-free knowledge transfer from a stronger modality (video) with richer training data to a weaker modality (IMU) with limited data.
A Self-supervised and Effective Cross-modal Queuing Mechanism: We introduce a cross-modal FIFO queue that maintains video embeddings as a stable and diverse reference distribution for IMU feature distillation, extending the instance queue distribution learning approach from single-modality to cross-modality.
Teacher-Student Model Agnostic: COMODO supports diverse video and time-series pretrained models, enabling flexible teacher-student configurations and future integration with stronger foundation models.
Cross-dataset Generalization: We demonstrate that COMODO maintains superior performance even when evaluated on unseen datasets, and more superior than fully supervised models, highlighting its robustness and generalizability for egocentric HAR tasks.

📂 Data & Results

All experimental results and ablation study findings can be found in the /results folder.

The /dataset folder contains the train, val, and test splits for each dataset, along with our preprocessing scripts. Specifically, ego4d_subset_ids.txt is a subset of all available IMU-containing IDs, which we obtained by applying the official Ego4D filter from their website. This represents the complete subset of data that we can access.

🚀 Getting started

Cross-modal Self-supervised Distillation

To run a Self-supervised Video-to-IMU Distillation, use the following command:

Note: [ ] denotes optional parameters.

Currently supported pretrained models:

Time-series models: MOMENT, Mantis

Video models: VideoMAE, TimeSformer

Other pretrained models can be used with minor modifications to the code.

python train.py \
    --video_ckpt "facebook/timesformer-base-finetuned-k400" \
    --imu_ckpt "paris-noah/Mantis-8M" \
    --dataset_path "DATASET_PATH" \
    --encoded_video_path "ENCODED_VIDEO_PATH" \
    --anchor_video_path "ANCHOR_VIDEO_PATH" \
    [--queue_size QUEUE_SIZE] \
    [--student_temp STUDENT_TEMP] \
    [--teacher_temp TEACHER_TEMP] \
    [--learning_rate LR] \
    [--num_epochs EPOCH] \
    [--batch_size BS] \
    [--num_clips 0] \
    [--seed SEED] \
    [--mlp_hidden_dim MLP_HIDDEN_DIM] \
    [--mlp_output_dim MLP_OUTPUT_DIM] \
    [--reduction "concat"] \
    [--is_raw true]

Unsupervised Representation Learning Evaluation

We evaluate the learned IMU representations in an unsupervised manner. See Section 3.2 in our paper. We train a Support Vector Machine (SVM) on the extracted IMU features and evaluate classification accuracy on the test set. Run the following command to start the evaluation:

python unsupervised_rep_test.py \
    --imu_ckpt "AutonLab/MOMENT-1-small" \
    --model_path "MODEL_WEIGHT_PATH" \
    --dataset_path "DATASET_PATH" \

🌍 Related Works & Baselines

There's a lot of outstanding work on time-series and human activity recognition! Here's an incomplete list. Checkout Table 1 in our paper for IMU-based Human Activity Recognition comparisons with these studies:

MOMENT: A Family of Open Time-series Foundation Models [Paper, Code, Hugging Face]
Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification [Paper, Code, Hugging Face]
TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis [Paper, Code]
DLinear: Are Transformers Effective for Time Series Forecasting? [Paper, Code]
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [Paper, Code]
IMU2CLIP: Language-grounded Motion Sensor Translation with Multimodal Contrastive Learning [Paper, Code]
CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised Pretraining [Paper, Code]
IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition [Paper, Code]
Attend and Discriminate: Beyond the State-of-the-Art for Human Activity Recognition Using Wearable Sensors [Paper, Code]
DeepConvLSTM: Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition [Paper]

Citation

If you find this repository useful for your research, please consider citing our paper:

@article{chen2025comodo,
  title={Comodo: Cross-modal video-to-imu distillation for efficient egocentric human activity recognition},
  author={Chen, Baiyu and Wongso, Wilson and Li, Zechen and Khaokaew, Yonchanok and Xue, Hao and Salim, Flora},
  journal={arXiv preprint arXiv:2503.07259},
  year={2025}
}

📩 Contact

If you have any questions or suggestions, feel free to contact Baiyu (Breeze) at breeze.chen(at)unsw(dot)edu(dot)au.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
comodo		comodo
dataset		dataset
results		results
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

Baiyu Chen^1,2, Wilson Wongso^1,2, Zechen Li¹, Yonchanok Khaokaew^1,2, Hao Xue^1,2, and Flora Salim^1,2

📢 News

🌟 Overview

🔑 The key features of COMODO:

📂 Data & Results

🚀 Getting started

Cross-modal Self-supervised Distillation

Unsupervised Representation Learning Evaluation

🌍 Related Works & Baselines

Citation

📩 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Baiyu Chen1,2, Wilson Wongso1,2, Zechen Li1, Yonchanok Khaokaew1,2, Hao Xue1,2, and Flora Salim1,2

📢 News

🌟 Overview

🔑 The key features of COMODO:

📂 Data & Results

🚀 Getting started

Cross-modal Self-supervised Distillation

Unsupervised Representation Learning Evaluation

🌍 Related Works & Baselines

Citation

📩 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

Baiyu Chen^1,2, Wilson Wongso^1,2, Zechen Li¹, Yonchanok Khaokaew^1,2, Hao Xue^1,2, and Flora Salim^1,2

Packages