PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue
Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. Existing cascade pipelines often discard rich acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration.
PRISM addresses these limitations through a multi-agent framework that decouples speech perception, response generation, and speech synthesis into coordinated components. The framework introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and supports on-demand invocation of external knowledge tools for empathetic dialogue generation.
git clone https://github.com/yourname/PRISM.git
cd PRISMconda create -n prism python=3.10
conda activate prismpip install -r requirements.txtExperiments are conducted on public empathetic dialogue datasets.
Please download the datasets from their official sources before training and evaluation:
- TOOL-ED: https://github.com/caohy123/EKTC
- AvaMERG: https://huggingface.co/datasets/ZhangHanXD/AvaMERG
For speech synthesis, we employ StyleTTS2 as the backbone TTS model.
StyleTTS2 can be obtained from:
