SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Model Summary

SpotSound is a model designed to enhance Large Audio-Language Models (ALMs) with fine-grained temporal grounding capabilities. Built on top of Audio Flamingo 3, SpotSound is capable of accurately pinpointing the exact start and end timestamps of specific acoustic events within long, untrimmed audio recordings based on natural language queries.

This model is particularly effective for "needle-in-a-haystack" audio retrieval tasks, where short target sounds are embedded within complex background noise.

Usage / Quick Start

To use SpotSound for inference, you need to download both the base Audio Flamingo 3 model and the SpotSound checkpoint.

1. Installation

First, clone the official SpotSound GitHub repository and set up the environment:

conda create -n SpotSound python=3.10
conda activate SpotSound
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

2. Run Inference

You can run inference directly from the command line using the provided script in the GitHub repository. Specify the path to the downloaded Audio Flamingo 3 model, your SpotSound checkpoint, the target audio file, and your text query.

export CUDA_VISIBLE_DEVICES=0 
python inference.py \
    --pretrain_model path_to_audioflamingo3 \
    --checkpoint ckpt/spotsound \
    --audio_path data/audio.wav \
    --query "dog barking"

Citation

If you use SpotSound or our benchmark in your research, please cite our paper:

@inproceedings{sun2026spotsound,
    title={SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding},
    author={Sun, Luoyi and Zhou, Xiao and Li, Zeqian and Zhang, Ya and Wang, Yanking and Xie, Weidi},
    journal={arXiv preprint arXiv:2604.13023},
    year={2026}
}