CLAP HTSAT Fused
laion/clap-htsat-fused
published Feb 2023 · updated Jan 2026
CLAP HTSAT Fused is a contrastive language-audio pretraining model that performs zero-shot audio classification, text-to-audio retrieval, and audio/text feature extraction.
specs
| Task | Zero-shot audio classification, text-to-audio retrieval, audio/text feature extraction |
| Architecture | HTSAT audio encoder + RoBERTa text encoder with feature fusion and keyword-to-caption augmentation |
| Parameters | Not specified |
about this model
LAION-CLAP (Contrastive Language-Audio Pretraining) with HTSAT-fused audio encoder is a zero-shot audio classification model that processes audio inputs of variable lengths and jointly learns representations from audio and natural language descriptions. It was trained on the LAION-Audio-630K dataset, which comprises 633,526 audio-text pairs totaling 4,325.39 hours of audio drawn from eight sources (Freesound, Epidemic Sound, Audiostock, BBC Sound Effects, Free To Use Sounds, Sonniss Game Effects, We Sound Effects, and Paramount Motion Sound Effects). The model uses HTSAT as its audio encoder and RoBERTa as its text encoder, and incorporates feature fusion and keyword-to-caption augmentation to improve performance on variable-length inputs.
Key capabilities
- Zero-shot audio classification: The model achieves state-of-the-art performance in zero-shot settings without requiring task-specific fine-tuning.
- Supervised audio classification: When fine-tuned, it reaches performance comparable to dedicated supervised models.
- Text-to-audio retrieval: Superior results on retrieval benchmarks, enabling natural language queries against audio databases.
- Variable-length audio support: Feature fusion mechanism allows processing of audio clips with diverse durations.
Training data
The LAION-Audio-630K dataset includes both publicly released subsets (BBC Sound Effects, Epidemic Sound, Audiostock, and Freesound) and licensed subsets (Free To Use Sounds, Sonniss Game Effects, We Sound Effects, and Paramount Motion) that are not publicly distributed. The Freesound component is available in two variants: a full version and a version with overlapping samples removed from common evaluation benchmarks (ESC50, FSD50K, Urbansound8K, Clotho).
Benchmark results
As reported in the accompanying paper (Wu et al., 2022), the model achieves state-of-the-art results on zero-shot audio classification tasks and demonstrates competitive performance on supervised audio classification when compared to non-zero-shot models. It also leads in text-to-audio retrieval metrics across standard evaluation datasets.
best for
- ·Zero-shot classification of audio events (e.g., dog barking, vacuum cleaner)
- ·Text-to-audio retrieval: finding audio clips matching a natural language query
- ·Extracting audio and text embeddings for downstream multimodal tasks
FAQ
Raw audio waveform (as a numpy array) and a list of candidate label strings.
A list of dictionaries with keys "score" and "label", ranked by confidence.
Send a POST request to the OpenAI-compatible endpoint with your API key and the audio data. See gigarouter documentation for details.
It uses HTSAT as the audio encoder and RoBERTa as the text encoder, combined with a feature fusion mechanism.
It was trained on LAION-Audio-630K, a collection of 633,526 audio-text pairs.
We're benchmarking and onboarding CLAP HTSAT Fused as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.