Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and retrieving the material from which it originates. To do so, we adopt a self-supervised learning approach that leverages a multi-track dataset to create positive pairs of artificial mixes, and design a novel contrastive learning objective. We show that such method significantly outperforms previous state-of-the-art baselines, that is robust to various genres, and that scales well when increasing the number of noise songs in the reference database. In addition, we extensively analyze the contribution of the different components of our training pipeline and highlight, in particular, the need for high-quality separated stems for this task.
Alain Riou, Joan Serrà, & Yuki Mitsufuji.
A. Riou, J. Serrà, & Y. Mitsufuji (2025). Automatic Music Sample Identification with Multi-Track Contrastive Learning. ArXiv: 2510.11507.
[arxiv] [checkpoint]
For inference, the only required dependencies are PyTorch, Hydra, numpy, scipy and tqdm. First, install them following the respective procedures. Then, clone and install this repository:
git clone https://github.com/sony/sampleid.git
cd sampleid
pip install -e .Then, to use the available checkpoint in your own Python project, type the following:
import torch
from sampleid import SampleID
sampleid_model = SampleID.load_checkpoint()
x = torch.randn(3, 16000 * 5) # 5 seconds of audio at 16kHz (batch of 3 mono signals)
with torch.inference_mode():
embeddings = sampleid_model(x, audio=True)
print(embeddings.shape) # should be (batch_size, 1, embed_dim)You can use a custom checkpoint by setting SampleID.load_checkpoint(ckpt_path="path/to/ckpt"). Otherwise, the checkpoint available on Zenodo will be loaded.
Note: the model always averages embeddings in time, regardless of the length of the audio input. If you want to compute embeddings per audio chunks from a full song, audio chunks should be computed manually beforehand (you can then process them in parallel as batches).
-
Clone the repository
git clone https://github.com/sony/sampleid.git
-
Install all the dependencies
-
This model is aimed to be trained on multi-track data. Before training, we start by pre-computing the activation masks and list of sources for each audio.
python src/compute_activations.py <path/to/your/dataset> <path/to/metadata>
By default, the metadata path defaults to
data/.You can define a validation and a test set by changing the flags at the beginning of the file.
By default, the format for the initial dataset is supposed to be:
your_dataset/ ├── part1/ | ├── song1/ | | ├── bass.wav | | ├── lead.wav | | ... | | └── piano.wav | ├── song2/ | | ├── guitar1.wav | | ├── guitar2.wav | | └── drums + perc.wav | ... ├── part2/ | ... └── valid/ └── whatever_song/ ├── vocals.wav ├── back.wav ├── drums.wav └── fx.wavIf it is different but you don't want to touch your dataset, or if you want to filter out specific instruments, etc. update the
list_wav_filesfunction or the definition ofall_song_dirsat the beginning ofmain. -
Write a
mydata.yamlfile inconfigs/datawith the appropriate paths, and following the given example. -
Train!
python src/train.py data=mydata model=resnet50 logger=csv
To use another logger, just replace
logger=csvbylogger=tensorboard,logger=wandb...
This repository builds upon the lightning-hydra-template which, as its name suggests, relies on PyTorch Lightning for training and Hydra for handling configurations. We refer to the corresponding documentations for more info.
Folder names are (hopefully) self-explanatory: configs are recursively defined in configs/ while source code is implemented in src/. scripts/ contains mostly evaluation scripts that are called either by the user or from the training process.
Within src, most folder names are clear. Just a few remarks:
- There are both
models/andnetworks/.models/contains the logic (training loops, loss functions) whilenetworks/contains neural architectures (ResNet, etc.) - Things implemented in
callbacks/are excluded from the computation of the signature (see below). Therefore, nothing affecting the final results of the training should be implemented here. - For most audio effects, we use the great pedalboard library. In
src/data/pedalboard.py, we implement a subclass of the original class that enables randomizing the parameters of the effects on-the-fly, making it quite handy to use. An example of the syntax is provided inconfigs/data/moisesdb.yaml.
For launching and managing different experiments, we use Dora.
We refer to Dora's doc for advanced usage. However, we provide here basic commands.
To start training your model locally, instead of python src/train.py, you can type dora run.
If you are working on a SLURM-based cluster, type:
dora launch -p <partition_name> -g <num_gpus> data=mydata model=resnet50 [whatever hydra args...]An interesting aspect of Dora is that it generates a hashed signature for each experiment based on the config. Checkpoints will be stored based on these signatures, and they are also used as default experiment name/id in Weights & Biases if you are using it. This implies that it is the same command for starting a new experiment, restarting a failed/timed-out one, and checking the final results of a finished run. Signatures can be re-injected in YAML configs using ${dora:xp.sig}.
Bonus: When you are debugging, it does not create dozens of empty log folders.
A drawback is that you cannot do dirty hacks by overwriting yaml configs on-the-fly. It will mess up EVERYTHING.
Instead do this:
- If you want to change a few parameters, use Hydra command-line.
- If you want to change many things, create a
configs/experiment/newconf.yamlwith the new options, then addexperiment=newconfat the end of your command. - If you want to a slight variant of a previous xp but with just one (or a few) different parameters, use dora's
-foption:dora run/launch -f <previous xp sig> param=new_value
Here are the performances of our model compared to previous state-of-the-art:
| Model | mAP | HR@1 | HR@10 |
|---|---|---|---|
| Cheston et al. (2025) | 0.441 | - | - |
| Bhattacharjee et al. (2025) | 0.442 | 0.155 | 0.191 |
| Ours | 0.603 | 0.587 | 0.733 |
| Ours + Top-5 retrieval | 0.622 | 0.600 | 0.747 |
Additional results are provided in the paper.
@article{RiouSampleID,
author = {Alain Riou and Joan Serrà and Yuki Mitsufuji},
title = {Automatic Music Sample Identification with Multi-Track Contrastive Learning},
url = {https://arxiv.org/abs/2510.11507},
year = {2025}
}