This repo contains PyTorch implementation of the following ICLR 2026 Oral paper:
UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning
Authors: Jinchuan Tian*, Sang-gil Lee*, Zhifeng Kong*, Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping
UALM is an advanced Audio-Language Model that unifies text and audio tasks including: text problem solving, audio understanding, text-to-audio generation, and multimodal reasoning across modalities. UALM matches the quality of state-of-the-art specialized models in each task, and is the first demonstration of cross-modal generative reasoning in the audio research domain.
Data preparation is perhaps the most complicated part of launching this model. Since there are too many files in audio-text modeling, we implemented a custom tarball-based storage (similar to WebDatasets) that is suitable for efficient audio storage and loading.
We have three types of data: text-only, text-to-audio, and audio-understanding.
(1) Get raw data (most difficult)
- Text-only: we collect data from NVIDIA's text datasets: dataset 1 and dataset 2. Each row of the
jsonlis{"input": [{"role": "user", "content": "Question"}], "output": "Output"}. - Audio understanding: follow Audio Flamingo 3 datasets (see here). We need a list of
jsonfiles in the format of
[
{
"id": "audio_id",
"sound": "/abs/path/to/wav",
"duration": 10.0,
"conversations": [{"from": "human", "value": "<sound>\nQuestion"}, {"from": "gpt", "value": "Answer"}]
},
...- Text-to-audio: follow ETTA datasets (see here). We need a list of
jsonlfiles in the format of
{"location": "/abs/path/to/wav", "start_time":0.0, "end_time":10.0, "duration":10.0, "caption": "Caption", "sample_rate": 22050}(2) Process raw data into shards and tar files. No need to process text data in this step.
- Audio understanding: fill in
ualm/tools/object_storage_manifest/manifest_config_examples/config_AF.yaml - Text-to-audio generation: fill in
ualm/tools/object_storage_manifest/manifest_config_examples/config_ETTA.yaml
Remember to fill in all paths properly whenever there is /path/to/... in the yaml. Then run
python batch_create_manifests.py --config manifest_config_examples/config_AF.yaml
python batch_create_manifests.py --config manifest_config_examples/config_ETTA.yaml(3) Prepare UALM manifest for each experiment
- Make a symlink
recipes/ualm_all_task/ualm/.tmpto map to.tmp/. - Fill in yaml files in
ualm/tools/tar_to_ualm_manifest_converter/manifest_config_examples. You could (and should) put every thing into a single yaml (somanifests.trainis a list of all datasets you want to train on). - Then run
python tools/tar_to_ualm_manifest_converter/convert_tar_to_ualm_manifest.py \
--config tools/tar_to_ualm_manifest_converter/manifest_config_examples/config_NAME.yaml \
--output-dir .tmp/manifest_NAME
(1) Go to the directory.
cd recipes/ualm_all_task/ualmNote that ualm_all_task is the experiment name that you can change but do not change the ualm name after it. This experiment name is ideal for managing experiments with major differences (e.g. a TTA-only model vs a multi-task model).
(2) Train the model
bash launch.shNote that exp_dir in launch.sh is a separate experiment name that the previous name. It is ideal to distinguish different training parameters such as number of nodes.
(3) Inference
bash inference.shSee Dockerfile for exact docker image creation. Alternatively, below is the environment based on Conda.
(1) Ensure you have a valid Python environment.
(2) Install Pytorch. A newer version is appreciated.
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128(3) Install dependencies
pip install -r requirement.txt(4) Install Flash attention. Recommend to build from source
# from pre-built wheel
pip install flash-attn --no-build-isolation
# or from source
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install(5) Install TorchCodec.
pip install torchcodec --index-url=https://download.pytorch.org/whl/cu129- Current training config points to the Qwen2.5-Omni-7B tokenizer for audio inputs. You could switch to AF-Whisper (Audio Flamingo 3 audio encoder) similar to this thread, store it to a HF checkpoint, and change
encoder_hf_model_tagvalue to/path/to/huggingface_cache/AF-Whisper.
@inproceedings{
tian2026ualm,
title={{UALM}: Unified Audio Language Model for Understanding, Generation and Reasoning},
author={Jinchuan Tian and Sang-gil Lee and Zhifeng Kong and Sreyan Ghosh and Arushi Goel and Chao-Han Huck Yang and Wenliang Dai and Zihan Liu and Hanrong Ye and Shinji Watanabe and Mohammad Shoeybi and Bryan Catanzaro and Rafael Valle and Wei Ping},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=TsdlOjcQNu}
}
The code structure is based on ESPNet.