Skip to content

cruiseresearchgroup/COMODO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

1 School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
2 ARC Centre of Excellence for Automated Decision Making + Society

arXiv python pytorch

📢 News

  • 05/2026: Excited to release our latest work AnyMo, a comprehensive framework for wearable motion understanding, covering synthetic IMU generation, geometry-aware pre-training, motion-language alignment, data resources, and a new benchmark.

  • 05/2026: Excited to share ZARAZero-training Activity Reasoning Agents, a training-free, evidence-grounded LLM agent framework for motion time-series reasoning — accepted as an ACL 2026 Oral paper!

🌟 Overview

COMODO is an open source framework for Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition.

🔑 The key features of COMODO:

  • Self-supervised Cross-modal Knowledge Transfer: We propose COMODO, a cross-modal self-supervised distillation framework that leverages pretrained video and time-series models enabling label-free knowledge transfer from a stronger modality (video) with richer training data to a weaker modality (IMU) with limited data.
  • A Self-supervised and Effective Cross-modal Queuing Mechanism: We introduce a cross-modal FIFO queue that maintains video embeddings as a stable and diverse reference distribution for IMU feature distillation, extending the instance queue distribution learning approach from single-modality to cross-modality.
  • Teacher-Student Model Agnostic: COMODO supports diverse video and time-series pretrained models, enabling flexible teacher-student configurations and future integration with stronger foundation models.
  • Cross-dataset Generalization: We demonstrate that COMODO maintains superior performance even when evaluated on unseen datasets, and more superior than fully supervised models, highlighting its robustness and generalizability for egocentric HAR tasks.

📂 Data & Results

All experimental results and ablation study findings can be found in the /results folder.

The /dataset folder contains the train, val, and test splits for each dataset, along with our preprocessing scripts. Specifically, ego4d_subset_ids.txt is a subset of all available IMU-containing IDs, which we obtained by applying the official Ego4D filter from their website. This represents the complete subset of data that we can access.

🚀 Getting started

Cross-modal Self-supervised Distillation

To run a Self-supervised Video-to-IMU Distillation, use the following command:

Note: [ ] denotes optional parameters.

Currently supported pretrained models:

  • Time-series models: MOMENT, Mantis
  • Video models: VideoMAE, TimeSformer

Other pretrained models can be used with minor modifications to the code.

python train.py \
    --video_ckpt "facebook/timesformer-base-finetuned-k400" \
    --imu_ckpt "paris-noah/Mantis-8M" \
    --dataset_path "DATASET_PATH" \
    --encoded_video_path "ENCODED_VIDEO_PATH" \
    --anchor_video_path "ANCHOR_VIDEO_PATH" \
    [--queue_size QUEUE_SIZE] \
    [--student_temp STUDENT_TEMP] \
    [--teacher_temp TEACHER_TEMP] \
    [--learning_rate LR] \
    [--num_epochs EPOCH] \
    [--batch_size BS] \
    [--num_clips 0] \
    [--seed SEED] \
    [--mlp_hidden_dim MLP_HIDDEN_DIM] \
    [--mlp_output_dim MLP_OUTPUT_DIM] \
    [--reduction "concat"] \
    [--is_raw true]

Unsupervised Representation Learning Evaluation

We evaluate the learned IMU representations in an unsupervised manner. See Section 3.2 in our paper. We train a Support Vector Machine (SVM) on the extracted IMU features and evaluate classification accuracy on the test set. Run the following command to start the evaluation:

python unsupervised_rep_test.py \
    --imu_ckpt "AutonLab/MOMENT-1-small" \
    --model_path "MODEL_WEIGHT_PATH" \
    --dataset_path "DATASET_PATH" \

🌍 Related Works & Baselines

There's a lot of outstanding work on time-series and human activity recognition! Here's an incomplete list. Checkout Table 1 in our paper for IMU-based Human Activity Recognition comparisons with these studies:

  • MOMENT: A Family of Open Time-series Foundation Models [Paper, Code, Hugging Face]
  • Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification [Paper, Code, Hugging Face]
  • TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis [Paper, Code]
  • DLinear: Are Transformers Effective for Time Series Forecasting? [Paper, Code]
  • Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [Paper, Code]
  • IMU2CLIP: Language-grounded Motion Sensor Translation with Multimodal Contrastive Learning [Paper, Code]
  • CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised Pretraining [Paper, Code]
  • IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition [Paper, Code]
  • Attend and Discriminate: Beyond the State-of-the-Art for Human Activity Recognition Using Wearable Sensors [Paper, Code]
  • DeepConvLSTM: Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition [Paper]

Citation

If you find this repository useful for your research, please consider citing our paper:

@article{chen2025comodo,
  title={Comodo: Cross-modal video-to-imu distillation for efficient egocentric human activity recognition},
  author={Chen, Baiyu and Wongso, Wilson and Li, Zechen and Khaokaew, Yonchanok and Xue, Hao and Salim, Flora},
  journal={arXiv preprint arXiv:2503.07259},
  year={2025}
}

📩 Contact

If you have any questions or suggestions, feel free to contact Baiyu (Breeze) at breeze.chen(at)unsw(dot)edu(dot)au.

About

[UbiComp/IMWUT '26] Official Repo for COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages