SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
Model Summary
SpotSound is a model designed to enhance Large Audio-Language Models (ALMs) with fine-grained temporal grounding capabilities. Built on top of Audio Flamingo 3, SpotSound is capable of accurately pinpointing the exact start and end timestamps of specific acoustic events within long, untrimmed audio recordings based on natural language queries.
This model is particularly effective for "needle-in-a-haystack" audio retrieval tasks, where short target sounds are embedded within complex background noise.
Usage / Quick Start
To use SpotSound for inference, you need to download both the base Audio Flamingo 3 model and the SpotSound checkpoint.
1. Installation
First, clone the official SpotSound GitHub repository and set up the environment:
conda create -n SpotSound python=3.10
conda activate SpotSound
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
2. Run Inference
You can run inference directly from the command line using the provided script in the GitHub repository. Specify the path to the downloaded Audio Flamingo 3 model, your SpotSound checkpoint, the target audio file, and your text query.
export CUDA_VISIBLE_DEVICES=0
python inference.py \
--pretrain_model path_to_audioflamingo3 \
--checkpoint ckpt/spotsound \
--audio_path data/audio.wav \
--query "dog barking"
Citation
If you use SpotSound or our benchmark in your research, please cite our paper:
@inproceedings{sun2026spotsound,
title={SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding},
author={Sun, Luoyi and Zhou, Xiao and Li, Zeqian and Zhang, Ya and Wang, Yanking and Xie, Weidi},
journal={arXiv preprint arXiv:2604.13023},
year={2026}
}
Acknowledgements
This project builds upon several excellent open-source efforts, notably:
Contact
For any questions or issues, please contact: [email protected].
Model tree for Loie/SpotSound
Base model
nvidia/audio-flamingo-3-hf