TimeMaster: Training Time-Series Multimodal LLMs to Reason via Reinforcement Learning
@ BERT2S NeurIPS 2025
TimeMaster is a reinforcement‑learning‑enhanced framework for training time‑series multimodal large language models (MLLMs). It enables structured, interpretable reasoning over visualized time‑series signals and has been evaluated on real‑world tasks such as EMG, ECG and Human Activity Recognition (HAR) using Qwen2.5‑VL‑3B‑Instruct.
- [2025.09.23]
TimeMasteraccepted at NeurIPS 2025 Workshop BERT2S. - [2025.06.21] SFT model released. See link.
- [2025.06.21] Code released.
- [2025.06.16] Our paper on
TimeMasterreleased. See link.
TimeMaster performs structured reasoning on time series images using reinforcement learning with composite rewards. The framework integrates format, hard, and soft rewards to improve classification, interpretability, and clinical insight generation.
conda create -n timemaster python=3.11 -y
conda activate timemaster
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2
pip3 install -r requirements_timemaster.txtCurrently, we provide the CTU dataset.
To preprocess the dataset, simply run the following script:
bash example/data_preprocess/ctu.shAfter successful execution, the following preprocessed data will be generated:
data/ctu_image/
├── images/
├── test/
├── train/
├── dataset_dict.json
├── test.parquet
└── train.parquet
Download the SFT model from our TimeMaster's HuggingFace using the command below:
huggingface-cli download langfeng01/TimeMaster-SFT-Qwen2.5-VL-3B-CTU --local-dir ./checkpoints/TimeMaster-SFT-Qwen2.5-VL-3B-CTU/
This will download all model files into the ./checkpoints/ directory.
We offer two types of training:
TimeMaster(SFT + RL): RL training initialized from a supervised fine-tuned (SFT) checkpoint. To use this, setMODEL_PATH=./checkpoints/TimeMaster-SFT-Qwen2.5-VL-3B-CTUin the script: ./example/grpo_trainer/run_ctu.shTimeMaster(RL): RL training from scratch using the base model. To use this, setMODEL_PATH=Qwen/Qwen2.5-VL-3B-Instructin the script: ./example/grpo_trainer/run_ctu.sh
After setting the appropriate MODEL_PATH, start the RL training by running:
bash example/grpo_trainer/run_ctu.sh After training, the model checkpoint will be saved in: ./checkpoints/
To start evaluation, set EVAL=True in the script: ./example/grpo_trainer/run_ctu.sh. Then, run the following command:
bash example/grpo_trainer/run_ctu.shTimeMaster supports additional datasets beyond CTU, including EMG, ECG, HAR, RCW, and TEE.
To process these datasets, follow the same data preparation pipeline demonstrated in example/data_preprocess/ctu.sh.
The core reward functions are located in ./verl/utils/reward_score/:
ctu.py: Implements format and accuracy rewards for the CTU dataset.emg_soft.py: Demonstrates a composite reward setup with three components — format, accuracy, and extension (the latter using the OpenAI API for soft evaluation).
We will release numeric-modality comparison resources to compare image-based and numeric-input configuration.
Accuracy and relative token cost of TimeMaster (SFT+RL) using image-based and numeric inputs across six tasks.
Impact of Input Modality.
To explore the effect of input modality, we compare TimeMaster using visual inputs (line plots) with a numeric-input variant that processes raw tokenized values, as shown in the figure above. Both models share the same Qwen2.5-3B architecture and are trained under an identical two-stage pipeline (SFT + RL), differing only in input format.
-
Visual inputs enable more robust reasoning. As shown, visual inputs consistently yield higher accuracy than numeric inputs. This is because numeric inputs impose a heavy symbolic burden on the model, often leading to hallucinations and fragmented reasoning that undermine performance. In contrast, visual representations capture global temporal structures, such as trends, peaks, and rhythmic patterns, that closely mirror the diagnostic strategies employed by human experts in fields such as ECG and EMG analysis.
-
Visual inputs offer superior token efficiency aligned with findings from DeepSeek-OCR. Numeric sequences scale linearly with length and often consume 5x more tokens than visual representations. For example, a 4k-point sequence can yield over 40k tokens, more than 88x its visual counterpart, whereas our visual inputs maintain fixed size regardless of sequence length, greatly improving scalability.
If TimeMaster helps your research, we would appreciate it if you could cite our work:
@article{zhang2025timemaster,
title={TimeMaster: Training Time-Series Multimodal LLMs to Reason via Reinforcement Learning},
author={Zhang, Junru and Feng, Lang and Guo, Xu and Wu, Yuhan and Dong, Yabo and Xu, Duanqing},
journal={arXiv preprint arXiv:2506.13705},
year={2025}
}We thank the veRL project for foundational RL infrastructure and Qwen2-VL-Finetune project for support in SFT.


