models / audio classification · coming soon

CLAP HTSAT Fused

laion/clap-htsat-fused

published Feb 2023 · updated Jan 2026

CLAP HTSAT Fused is a contrastive language-audio pretraining model that performs zero-shot audio classification, text-to-audio retrieval, and audio/text feature extraction.

status

coming soon

API providers

downloads / mo

13.3M

license

apache-2.0

specs

Task	Zero-shot audio classification, text-to-audio retrieval, audio/text feature extraction
Architecture	HTSAT audio encoder + RoBERTa text encoder with feature fusion and keyword-to-caption augmentation
Parameters	Not specified

about this model

LAION-CLAP (Contrastive Language-Audio Pretraining) with HTSAT-fused audio encoder is a zero-shot audio classification model that processes audio inputs of variable lengths and jointly learns representations from audio and natural language descriptions. It was trained on the LAION-Audio-630K dataset, which comprises 633,526 audio-text pairs totaling 4,325.39 hours of audio drawn from eight sources (Freesound, Epidemic Sound, Audiostock, BBC Sound Effects, Free To Use Sounds, Sonniss Game Effects, We Sound Effects, and Paramount Motion Sound Effects). The model uses HTSAT as its audio encoder and RoBERTa as its text encoder, and incorporates feature fusion and keyword-to-caption augmentation to improve performance on variable-length inputs.

Key capabilities

Zero-shot audio classification: The model achieves state-of-the-art performance in zero-shot settings without requiring task-specific fine-tuning.
Supervised audio classification: When fine-tuned, it reaches performance comparable to dedicated supervised models.
Text-to-audio retrieval: Superior results on retrieval benchmarks, enabling natural language queries against audio databases.
Variable-length audio support: Feature fusion mechanism allows processing of audio clips with diverse durations.

Training data

The LAION-Audio-630K dataset includes both publicly released subsets (BBC Sound Effects, Epidemic Sound, Audiostock, and Freesound) and licensed subsets (Free To Use Sounds, Sonniss Game Effects, We Sound Effects, and Paramount Motion) that are not publicly distributed. The Freesound component is available in two variants: a full version and a version with overlapping samples removed from common evaluation benchmarks (ESC50, FSD50K, Urbansound8K, Clotho).

Benchmark results

As reported in the accompanying paper (Wu et al., 2022), the model achieves state-of-the-art results on zero-shot audio classification tasks and demonstrates competitive performance on supervised audio classification when compared to non-zero-shot models. It also leads in text-to-audio retrieval metrics across standard evaluation datasets.

best for

·Zero-shot classification of audio events (e.g., dog barking, vacuum cleaner)
·Text-to-audio retrieval: finding audio clips matching a natural language query
·Extracting audio and text embeddings for downstream multimodal tasks

FAQ

What input format does the zero-shot audio classification pipeline expect?

Raw audio waveform (as a numpy array) and a list of candidate label strings.

What output does the model produce for zero-shot classification?

A list of dictionaries with keys "score" and "label", ranked by confidence.

How can I call this model via the gigarouter API?

Send a POST request to the OpenAI-compatible endpoint with your API key and the audio data. See gigarouter documentation for details.

What is the architecture of CLAP HTSAT Fused?

It uses HTSAT as the audio encoder and RoBERTa as the text encoder, combined with a feature fusion mechanism.

What dataset was this model trained on?

It was trained on LAION-Audio-630K, a collection of 633,526 audio-text pairs.

not yet live

We're benchmarking and onboarding CLAP HTSAT Fused as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.