InstructTime

InstrucTime: Advancing Time Series Classification with Multimodal Language Modeling

ACM WSDM 2025

🙋 Please let us know if you find out a mistake or have any suggestions!

🌟 If you find this resource helpful, please consider starring this repository and citing our research.

Citation

@inproceedings{cheng2025instructime,
  title={Instructime: Advancing time series classification with multimodal language modeling},
  author={Cheng, Mingyue and Chen, Yiheng and Liu, Qi and Liu, Zhiding and Luo, Yucong and Chen, Enhong},
  booktitle={Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining},
  pages={792--800},
  year={2025}
}

Overview

InstructTime is a multimodal language model for time series classification that bridges the gap between time series data and natural language understanding.

Resources

Resource	Link
🤗 Dataset	zhjai/InstructTime
🤗 Base Model	openai-community/gpt2
📄 Paper	ACM Digital Library

Dataset Name Mapping

The following table shows the mapping between dataset names used in the code and their corresponding domains:

Code Name	Domain	Description
`sleep`	EEG	Electroencephalogram (Sleep Stage)
`geo` / `ecg`	ECG	Electrocardiogram
`dev`	FD	Fault Detection (Industrial Equipment)
`har`	HAR	Human Activity Recognition
`whale`	RWC	Real World Computing (Whale Sound)

Installation

Requirements

Python 3.9+
PyTorch 2.1+
CUDA (recommended)

Setup

# Clone the repository
git clone https://github.com/your-repo/InstructTime.git
cd InstructTime

# Install dependencies
pip install -r requirements.txt

# Download GPT-2 model from HuggingFace (required)
# Option 1: Using huggingface-cli
huggingface-cli download openai-community/gpt2 --local-dir ./gpt2

# Option 2: Using git lfs
git lfs install
git clone https://huggingface.co/openai-community/gpt2 ./gpt2

Usage

Step 1: Train TStokenizer (Time Series Tokenizer)

First, train the VQ-VAE based time series tokenizer for each domain.

Parameters for each dataset (format: d_model, n_embed, wave_length):

Dataset	d_model	n_embed	wave_length
ECG (geo)	64	128	40
EEG (sleep)	64	256	25
FD (dev)	64	512	40
HAR	64	256	1
RWC (whale)	64	384	32

cd TStokenizer

# Train tokenizer for HAR dataset
python main.py \
    --save_path ../vqvae/HAR \
    --dataset har \
    --data_path ../datasets/HAR \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 256 \
    --wave_length 1

# Train tokenizer for EEG (sleep) dataset
python main.py \
    --save_path ../vqvae/EEG \
    --dataset sleep \
    --data_path ../datasets/EEG \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 256 \
    --wave_length 25

# Train tokenizer for ECG (geo) dataset
python main.py \
    --save_path ../vqvae/ECG \
    --dataset geo \
    --data_path ../datasets/ECG \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 128 \
    --wave_length 40

# Train tokenizer for FD (dev) dataset
python main.py \
    --save_path ../vqvae/FD \
    --dataset dev \
    --data_path ../datasets/FD \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 512 \
    --wave_length 40

# Train tokenizer for RWC (whale) dataset
python main.py \
    --save_path ../vqvae/RWC \
    --dataset whale \
    --data_path ../datasets/RWC \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 384 \
    --wave_length 32

Step 2: Cross-Domain Autoregressive Pretraining

First, perform cross-domain pretraining using all five domains jointly.

cd ..  # Back to project root

python run_pretrain_universal.py \
    --dataset mix \
    --model_path ./gptmodel \
    --data_root ./datasets \
    --vqvae_root ./vqvae \
    --device cuda:0 \
    --epochs 15 \
    --batch_size 16 \
    --lr 5e-5

Step 3: Supervised Fine-Tuning

After pretraining, fine-tune using the pretrained model.

Universal Training

python run_truth_loss.py \
    --dataset mix \
    --model_path ./gptmodel \
    --load_model_path ./gptmodel/no_frozen/run_0/best_model \
    --data_root ./datasets \
    --vqvae_root ./vqvae \
    --device cuda:0 \
    --epochs 15 \
    --batch_size 16 \
    --lr 1e-5 \
    --adapt

Adaptation Training

python run_truth_loss.py \
    --dataset har \
    --model_path ./gptmodel/har \
    --load_model_path ./gptmodel/no_frozen/run_0/best_model \
    --data_root ./datasets \
    --vqvae_root ./vqvae \
    --device cuda:0 \
    --epochs 15 \
    --batch_size 16 \
    --lr 1e-5 \
    --adapt

Prompt Example

You will be receiving electroencephalogram(EEG) related signals.
Electroencephalogram signals: <BET><TS Tokens><EET>
The sleep patterns include waking up, rapid eye movement sleep, and sleep stages one through four, as well as periods of movement and unidentified stages.
Select one of the eight previously mentioned sleep patterns and report on the person's sleep using the provided information.
The person's sleep pattern is waking up

Project Structure

InstructTime/
├── TStokenizer/          # Time Series Tokenizer (VQ-VAE)
│   ├── main.py           # Tokenizer training script
│   ├── model.py          # VQ-VAE model
│   └── ...
├── datasets/             # Dataset directory
├── vqvae/                # Trained tokenizer checkpoints
├── gpt2/                 # GPT-2 base model
├── run_pretrain_universal.py  # Cross-domain pretraining script
├── run_truth_loss.py         # Supervised fine-tuning script
├── multidataset.py       # Dataset processing
├── multimodel.py         # Model definition
├── args.py               # Argument parser
├── metrics.py            # Evaluation metrics
└── requirements.txt      # Dependencies

License

This project is for research purposes only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstructTime

InstrucTime: Advancing Time Series Classification with Multimodal Language Modeling

Citation

Overview

Resources

Dataset Name Mapping

Installation

Requirements

Setup

Usage

Step 1: Train TStokenizer (Time Series Tokenizer)

Step 2: Cross-Domain Autoregressive Pretraining

Step 3: Supervised Fine-Tuning

Universal Training

Adaptation Training

Prompt Example

Project Structure

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
TStokenizer		TStokenizer
README.md		README.md
args.py		args.py
challeng_score.py		challeng_score.py
essy.csv		essy.csv
metrics.py		metrics.py
multidataset.py		multidataset.py
multimodel.py		multimodel.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run_pretrain_universal.py		run_pretrain_universal.py
run_truth_loss.py		run_truth_loss.py
utils.py		utils.py
weights.csv		weights.csv

Folders and files

Latest commit

History

Repository files navigation

InstructTime

InstrucTime: Advancing Time Series Classification with Multimodal Language Modeling

Citation

Overview

Resources

Dataset Name Mapping

Installation

Requirements

Setup

Usage

Step 1: Train TStokenizer (Time Series Tokenizer)

Step 2: Cross-Domain Autoregressive Pretraining

Step 3: Supervised Fine-Tuning

Universal Training

Adaptation Training

Prompt Example

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages