๐ News: This survey has been ACCEPTED to the Lecture Style Tutorials Track of KDD 2025 as a HALF-DAY tutorial! ๐
[ๆถๅบไบบไธญๆ่งฃ่ฏป] [ๅๅ็็ฎๆณ็ฌ่ฎฐไธญๆ่งฃ่ฏป] [ๆทฑๅบฆๅพๅญฆไน ไธๅคงๆจกๅLLMไธญๆ่งฃ่ฏป] [QuantMLไธญๆ่งฃ่ฏป]
This is the official repository for "Multi-modal Time Series Analysis: A Tutorial and Survey". [Paper]
This repository is maintained by Yushan Jiang and Kanghui Ning from UConn DSIS.
Please consider citing our survey paper if you find it helpful :), and feel free to share this repository with others!
This survey aims to provide a unique and systematic perspective on effectively leveraging cross-modal interactions from relevant real-world contexts to advance multi-modal time series analysis, addressing both foundational principles and practical solutions. Our assessment is threefold:
- Reviewing multi-modal time series data
- Analyzing cross-modal interactions between time series and other modalities (Fusion, Alignment, Transference)
- Demonstrating revealing the impact of multi-modal time series analysis in applications across diverse domains.
![]() |
![]() |
|---|---|
| Figure 1: The Framework of Our Survey | Figure 2: Categorization of cross-modal interaction methods and representative examples |
| Domain | Dataset | Modalities |
|---|---|---|
| Healthcare | MIMIC-III[1], MIMIC-IV[2] | TS, Text, Table |
| ICBHI[3], Coswara[4], KAUH[5], PTB-XL[6], ZuCo[7] | TS, Text | |
| Image-EEG[8] | TS, Image | |
| Finance | FNSPID[9], ACL18[10], CIKM18[11], DOW30[12] | TS, Text |
| Multi-domain | MTBench[13], Time-MMD[14], TimeCAP[15], NewsForecast[16], TTC[17], CiK[18], TSQA[19] | TS, Text |
| Retail | VISUELLE[20] | TS, Image, Text |
| IoT | LEMMA-RCA[21] | TS, Text |
| Speech | LRS3[22], VoxCeleb2[23] | TS (Audio), Image |
| Traffic | NYC-taxi, NYC-bike[24] | ST, Text |
| Environment | Terra[25] | ST, Text |
We define three fundamental types of interactions between time series and other modalities, including Fusion, Alignment, and Transference, which occur at different stages within a framework --- Input, Intermediate (i.e., representations or intermediate outputs), and Output.
- Fusion refers to the process of integrating heterogeneous modalities in a way that captures complementary information across diverse sources to improve time series modeling.
- Alignment ensures that the relationships between different modalities are preserved and semantically coherent when integrated into a unified learning framework.
- Transference refers to the process of mapping between different modalities, which allows one modality to be inferred, translated, or synthesized from another.
Note:
- F: Fusion; A: Alignment; T: Transference
| Method | Modality | Domain | Task | Stage | F | A | T | Method | Large Model |
|---|---|---|---|---|---|---|---|---|---|
| Time-MMD (NeurIPS 2024)ย Code |
TS, Text | General | Forecasting | Output | โ | โ | โ | Addition | Multiple |
| Wang et al. (NeurIPS 2024)ย Code |
TS, Text | General | Forecasting | Input | โ | โ | โ | Prompt | LLaMa2, GPT-4 Turbo |
| Intermediate | โ | โ | โ | Prompt; LLM reasoning | |||||
| GPT4MTS (AAAI 2024) |
TS, Text | General | Forecasting | Intermediate | โ | โ | โ | Addition; Self-attention | GPT-2 |
| TimeCMA (AAAI 2025)ย Code |
TS, Text | General | Forecasting | Input | โ | โ | โ | Meta-description | GPT-2 |
| Intermediate | โ | โ | โ | Addition; Cross-attention | |||||
| MOAT (2024) |
TS, Text | General | Forecasting | Intermediate | โ | โ | โ | Concat.; Self-attention | S-Bert |
| Output | โ | โ | โ | Offline synthesis | |||||
| TimeCAP (AAAI 2025) |
TS, Text | General | Classification | Input | โ | โ | โ | LLM Generation | Bert, GPT-4 |
| Intermediate | โ | โ | โ | Concat.; Self-attention, Retrieval | |||||
| Output | โ | โ | โ | Addition | |||||
| TimeXL (NeurIPS 2025) |
TS, Text | General | Classification | Intermediate | โ | โ | โ | Concat., Prompt; LLM Reasoning | Bert, S-Bert, GPT-4o |
| Forecasting | Output | โ | โ | โ | Addition | ||||
| Hybrid-MMF (2024)ย Code |
TS, Text | General | Forecasting | Intermediate | โ | โ | โ | Concat. | GPT-4o |
| Time-LLM (ICLR 2024)ย Code |
TS, Text | General | Forecasting | Input | โ | โ | โ | Meta-description | LLaMA, GPT-2 |
| Intermediate | โ | โ | โ | Concat.; Self-attention | |||||
| Time-VLM (2025) |
TS, Text, Image | General | Forecasting | Input | โ | โ | โ | Feat. Imaging, Meta-description | ViLT, CLIP, BLIP-2 |
| Intermediate | โ | โ | โ | Addition; Gating, Cross-attention | |||||
| Unitime (WWW 2024) |
TS, Text | General | Forecasting | Input | โ | โ | โ | Meta-description | GPT-2 |
| Intermediate | โ | โ | โ | Concat.; Self-attention | |||||
| TESSA (2024) |
TS, Text | General | Annotation | Intermediate | โ | โ | โ | Prompt; RL; LLM Generation | GPT-4o |
| InstrucTime (WSDM 2025)ย Code |
TS, Text | General | Classification | Intermediate | โ | โ | โ | Concat.; Self-attention | GPT-2 |
| MATMCD (2024) |
TS, Text, Graph | General | Causal Discovery | Intermediate | โ | โ | โ | Prompt; LLM Reasoning; Supervision | Multiple |
| STG-LLM (2024) |
ST, Text | General | Forecasting | Intermediate | โ | โ | โ | Concat.; Self-attention | GPT-2 |
| TableTime (2024)ย Code |
TS, Text | General | Classification | Input | โ | โ | โ | Prompt; Reformulate | Multiple |
| ContextFormer (2024) |
TS, Table | General | Forecasting | Intermediate | โ | โ | โ | Addition; Cross-attention | No |
| Time-MQA (2025)ย Code |
TS, Text | General | Multiple | Input | โ | โ | โ | Prompt | Multiple |
| MAN-SF (EMNLP 2020) |
TS, Text, Graph | Finance | Classification | Intermediate | โ | โ | โ | Bilinear; Graph Convolution | USE |
| Bamford et al. (ICAIF 2023) |
TS, Text | Finance | Retrieval | Intermediate | โ | โ | โ | Supervision | S-bert |
| TS, Image | Output | โ | โ | โ | |||||
| Chen et al. (2023) |
TS, Text, Graph | Finance | Classification | Input | โ | โ | โ | LLM Generation | ChatGPT |
| Intermediate | โ | โ | โ | Concat.; Graph Convolution | |||||
| Xie et al. (2023) |
TS, Text | Finance | Classification | Input | โ | โ | โ | Prompt | ChatGPT |
| Yu et al. (EMNLP 2023) |
TS, Text | Finance | Forecasting | Input | โ | โ | โ | Prompt | GPT-4, Open LLaMA |
| MedTsLLM (2024)ย Code |
TS, Text, Table | Healthcare | Multiple | Intermediate | โ | โ | โ | Concat.; Self-attention | Llama2 |
| RespLLM (2024)ย Code |
TS (Audio), Text | Healthcare | Classification | Intermediate | โ | โ | โ | Addition, Self-attention | OpenBioLLM-8B |
| METS (2023) |
TS, Text | Healthcare | Classification | Output | โ | โ | โ | Contrastive | ClinicalBert |
| Wang et al. (AAAI 2022) |
TS, Text | Healthcare | Classification | Intermediate | โ | โ | โ | Supervision | Bart, Bert, RoBerta |
| EEG2TEXT (BigData 2024) |
TS, Text | Healthcare | Generation | Output | โ | โ | โ | Self-supervision, Supervision | Bart |
| MEDHMP (EMNLP 2023)ย Code |
TS, Text | Healthcare | Classification | Intermediate | โ | โ | โ | Concat.; Self-attention, Contrastive | ClinicalT5 |
| Deznabi et al. (ACL 2021)ย Code |
TS, Text | Healthcare | Classification | Intermediate | โ | โ | โ | Concat. | Bio+Clinical Bert |
| Niu et al. (2023) |
TS, Text | Healthcare | Classification | Intermediate | โ | โ | โ | Concat.; Cross-attention | BioBERT |
| Yang et al. (EMNLP 2021)ย Code |
TS, Text | Healthcare | Classification | Intermediate | โ | โ | โ | Concat., Addition; Gating | ClinicalBERT |
| Liu et al. (2023)ย Code |
TS, Text | Healthcare | Classification, Regression | Input | โ | โ | โ | Prompt | PaLM |
| xTP-LLM (2024)ย Code |
ST, Text | Traffic | Forecasting | Input | โ | โ | โ | Prompt; Meta-description | Llama2-7B-chat |
| UrbanGPT (2024)ย Code |
ST, Text | Traffic | Forecasting | Input | โ | โ | โ | Prompt; Meta-description | Vicuna-7B |
| CityGPT (2024)ย Code |
ST, Text | Mobility | Multiple | Input | โ | โ | โ | Prompt | Multiple |
| MULAN (WWW 2024) |
TS, Text, Graph | IoT | Causal Discovery | Intermediate | โ | โ | โ | Addition; Contrastive; Supervision | No |
| MIA (2023) |
TS, Image | IoT | Anomaly Detection | Intermediate | โ | โ | โ | Addition; Cross-attention, Gating | No |
| Ekambaram et al. (KDD 2020)ย Code |
TS, Image, Text | Retail | Forecasting | Intermediate | โ | โ | โ | Concat.; Self & Cross-attention | No |
| Skenderi et al. (2024)ย Code |
TS, Image, Text | Retail | Forecasting | Intermediate | โ | โ | โ | Concat.; Cross-attention | No |
| VIMTS (BigData 2022) |
ST, Image | Environment | Imputation | Intermediate | โ | โ | โ | Concat.; Supervision | No |
| LITE (2024)ย Code |
ST, Text, Image | Environment | Forecasting | Intermediate | โ | โ | โ | Concat.; Self-attention | LLaMA-2-7b |
| AV-HuBERT (ICLR 2022)ย Code |
TS (Audio), Image | Speech | Classification | Intermediate | โ | โ | โ | Concat.; Self-attention | HuBert |
| SpeechGPT (EMNLP 2023)ย Code |
TS(Audio), Text | Speech | Generation | Intermediate | โ | โ | โ | Concat.; Self-attention | LLaMA-13B |
| LA-GCN (2023)ย Code |
ST, Text | Vision | Classification | Intermediate | โ | โ | โ | Supervision | Bert |
title={Multi-modal Time Series Analysis: A Tutorial and Survey},
author={Yushan Jiang and Kanghui Ning and Zijie Pan and Xuyang Shen and Jingchao Ni and Wenchao Yu and Anderson Schneider and Haifeng Chen and Yuriy Nevmyvaka and Dongjin Song},
year={2025},
eprint={2503.13709},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.13709},
}

