Skip to content

Code of Paper《Explainable Action Form Assessment by Exploiting Multi-modal Chain-of-Thoughts Reasoning》

Notifications You must be signed in to change notification settings

MICLAB-BUPT/EFA

Repository files navigation

Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning


arXiv GitHub Kaggle Dataset: CoT-AFA
1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China

In real-world scenarios like fitness training and martial arts, evaluating if human actions conform to standard forms is essential for safety and effectiveness. Traditional video understanding focuses on what and where actions occur, but our work introduces the Action Form Assessment (AFA) task to assess how well actions are performed against objective standards. We present the CoT-AFA dataset, featuring diverse workout videos with Chain-of-Thought explanations that provide step-by-step reasoning, error analysis, and corrective solutions, enabling explainable feedback for skill improvement.

Release

  • 2025-12-25 🚀 Released the CoT-AFA dataset and EFA source code on GitHub.
  • 2025-12-20 ♥️ Our paper is available on arXiv!

CoT-AFA

Overview: We introduce CoT-AFA, a diverse dataset for the Human Action Form Assessment (AFA) task. It includes 3,392 videos (364,812 frames) of fitness and martial arts actions, with annotations for action categories, standardization (standard/non-standard), multiple viewpoints, and Chain-of-Thought text explanations. The dataset supports tasks like action classification, quality assessment, and explainable feedback generation.

Dataset Workout modes Workout types Action categories Standard Videos Non-standard Videos Total Videos Total Frames CoT Text Explanations
CoT-AFA 2 28 141 2,242 1,150 3,392 364,812 3,392

CoT-AFA features a three-level lexicon (workout mode, type, category) and multi-view annotations for comprehensive analysis.

Results

Our Explainable Fitness Assessor (EFA) framework achieves significant improvements:

  • Explanation generation: +16.0% in CIDEr
  • Action classification: +2.7% in accuracy
  • Quality assessment: +2.1% in accuracy

These results highlight the effectiveness of multimodal fusion and Chain-of-Thought reasoning in AFA.

Run EFA

Requirements

  • Python 3.8
  • PyTorch 2.4.1+cu124
  • Transformers 4.46.3
  • OpenCV 4.12.0.88
  • Pandas 2.0.3
  • NumPy 1.24.1
  • Pillow 10.2.0
  • scikit-learn 1.3.2
  • DeepSpeed 0.17.6
  • And other dependencies listed in requirements.txt

Dataset Preparation

Dataset Download

The model is trained and evaluated on the GYM88 dataset, which contains fitness exercise videos with quality annotations.

  1. Download the CoT-AFA dataset: The dataset can be obtained from https://www.kaggle.com/datasets/dd34dc6f49a960a31e03af896f85be526a72f8c9a684defd715c75d62bedbdc2.

  2. Dataset Structure: After downloading, organize the dataset as follows:

    AQA_data/
    ├── workout_ori/          # Original video files (.mp4)
    │   ├── 00/               # Exercise class 0 videos
    │   ├── 01/               # Exercise class 1 videos
    │   └── ...              # Other exercise classes
    └── frames/               # Extracted video frames (.jpg)
        ├── 00_00/     # Frames for video 1
        ├── 00_01/     # Frames for video 2
        └── ...
    

Configuration

Modifying Dataset Paths

Edit the configuration file _args/AFA.yaml to update the dataset paths:

dataset: {
    data_root: /path/to/AQA_data/workout_ori,  # Path to original videos
    video_dir: /path/to/AQA_data/frames,       # Path to extracted frames
    yaml_file: ./swinbert_val.yaml,
    train_datafile: train.pkl,                 # Path to train split pickle
    test_datafile: test.pkl,                   # Path to test split pickle
    max_seq_len: 256,
    input_frame_size: [398, 224],
    crop_frame_size: 224,
  }

Training

To train the model:

CUDA_VISIBLE_DEVICES=0 python main_caption.py --config _args/args_AFA.json --path_output output

For multi-GPU training:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=5567 main_caption.py --config _args/args_AFA.json --path_output output

Evaluation

The model evaluates on COCO captioning metrics (BLEU, METEOR, ROUGE-L, CIDEr) and regression/classification metrics for quality assessment.

Evaluation results are saved in the output/ directory with detailed metrics.

Acknowledgement

Our evaluation code is build upon LAVENDER. We acknowledge their team for their valuable contributions to the open-source community and for providing an excellent reference for video captioning evaluation.

Citation

If you find our paper, dataset, or code useful, please cite:

@misc{qi2025explainableactionformassessment,
      title={Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning}, 
      author={Mengshi Qi and Yeteng Wu and Xianlin Zhang and Huadong Ma},
      year={2025},
      eprint={2512.15153},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.15153}, 
}

About

Code of Paper《Explainable Action Form Assessment by Exploiting Multi-modal Chain-of-Thoughts Reasoning》

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published