UALM

PyTorch Implementation of UALM

This repo contains PyTorch implementation of the following ICLR 2026 Oral paper:

UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning

Authors: Jinchuan Tian*, Sang-gil Lee*, Zhifeng Kong*, Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping

UALM is an advanced Audio-Language Model that unifies text and audio tasks including: text problem solving, audio understanding, text-to-audio generation, and multimodal reasoning across modalities. UALM matches the quality of state-of-the-art specialized models in each task, and is the first demonstration of cross-modal generative reasoning in the audio research domain.

Data Preparation

Data preparation is perhaps the most complicated part of launching this model. Since there are too many files in audio-text modeling, we implemented a custom tarball-based storage (similar to WebDatasets) that is suitable for efficient audio storage and loading.

We have three types of data: text-only, text-to-audio, and audio-understanding.

(1) Get raw data (most difficult)

Text-only: we collect data from NVIDIA's text datasets: dataset 1 and dataset 2. Each row of the jsonl is {"input": [{"role": "user", "content": "Question"}], "output": "Output"}.
Audio understanding: follow Audio Flamingo 3 datasets (see here). We need a list of json files in the format of

[
  {
    "id": "audio_id",
    "sound": "/abs/path/to/wav",
    "duration": 10.0,
    "conversations": [{"from": "human", "value": "<sound>\nQuestion"}, {"from": "gpt", "value": "Answer"}]
  },
  ...

Text-to-audio: follow ETTA datasets (see here). We need a list of jsonl files in the format of

{"location": "/abs/path/to/wav", "start_time":0.0, "end_time":10.0, "duration":10.0, "caption": "Caption", "sample_rate": 22050}

(2) Process raw data into shards and tar files. No need to process text data in this step.

Audio understanding: fill in ualm/tools/object_storage_manifest/manifest_config_examples/config_AF.yaml
Text-to-audio generation: fill in ualm/tools/object_storage_manifest/manifest_config_examples/config_ETTA.yaml

Remember to fill in all paths properly whenever there is /path/to/... in the yaml. Then run

python batch_create_manifests.py --config manifest_config_examples/config_AF.yaml
python batch_create_manifests.py --config manifest_config_examples/config_ETTA.yaml

(3) Prepare UALM manifest for each experiment

Make a symlink recipes/ualm_all_task/ualm/.tmp to map to .tmp/.
Fill in yaml files in ualm/tools/tar_to_ualm_manifest_converter/manifest_config_examples. You could (and should) put every thing into a single yaml (so manifests.train is a list of all datasets you want to train on).
Then run

python tools/tar_to_ualm_manifest_converter/convert_tar_to_ualm_manifest.py \
    --config tools/tar_to_ualm_manifest_converter/manifest_config_examples/config_NAME.yaml \
    --output-dir .tmp/manifest_NAME

Launch the training and inference:

(1) Go to the directory.

cd recipes/ualm_all_task/ualm

Note that ualm_all_task is the experiment name that you can change but do not change the ualm name after it. This experiment name is ideal for managing experiments with major differences (e.g. a TTA-only model vs a multi-task model).

(2) Train the model

bash launch.sh

Note that exp_dir in launch.sh is a separate experiment name that the previous name. It is ideal to distinguish different training parameters such as number of nodes.

(3) Inference

bash inference.sh

Environment

See Dockerfile for exact docker image creation. Alternatively, below is the environment based on Conda.

Local Installation (miniconda)

(1) Ensure you have a valid Python environment.

(2) Install Pytorch. A newer version is appreciated.

pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128

(3) Install dependencies

pip install -r requirement.txt

(4) Install Flash attention. Recommend to build from source

# from pre-built wheel
pip install flash-attn --no-build-isolation 
# or from source
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install

(5) Install TorchCodec.

pip install torchcodec --index-url=https://download.pytorch.org/whl/cu129

Additional Notes

Current training config points to the Qwen2.5-Omni-7B tokenizer for audio inputs. You could switch to AF-Whisper (Audio Flamingo 3 audio encoder) similar to this thread, store it to a HF checkpoint, and change encoder_hf_model_tag value to /path/to/huggingface_cache/AF-Whisper.

Citation

@inproceedings{
    tian2026ualm,
    title={{UALM}: Unified Audio Language Model for Understanding, Generation and Reasoning},
    author={Jinchuan Tian and Sang-gil Lee and Zhifeng Kong and Sreyan Ghosh and Arushi Goel and Chao-Han Huck Yang and Wenliang Dai and Zihan Liu and Hanrong Ye and Shinji Watanabe and Mohammad Shoeybi and Bryan Catanzaro and Rafael Valle and Wei Ping},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=TsdlOjcQNu}
}

Code Reference

The code structure is based on ESPNet.

Name		Name	Last commit message	Last commit date
parent directory ..
dataloader		dataloader
models		models
recipes/ualm_all_task/ualm		recipes/ualm_all_task/ualm
scripts		scripts
tools		tools
trainer		trainer
utils		utils
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt
try_dataloader.py		try_dataloader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

PyTorch Implementation of UALM

Data Preparation

Launch the training and inference:

Environment

Local Installation (miniconda)

Additional Notes

Citation

Code Reference

FilesExpand file tree

UALM

Directory actions

More options

Directory actions

More options

Latest commit

History

UALM

Folders and files

parent directory

README.md

PyTorch Implementation of UALM

Data Preparation

Launch the training and inference:

Environment

Local Installation (miniconda)

Additional Notes

Citation

Code Reference