models / zero-shot image · coming soon

clip-vit-base-patch32

openai/clip-vit-base-patch32

A popular open zero-shot image model, with 22.3M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

22.3M

about this model

The openai/clip-vit-base-patch32 model performs zero-shot image classification by computing the similarity between an image and a set of natural language text descriptions. It uses a ViT-B/32 Transformer as an image encoder and a masked self-attention Transformer as a text encoder, trained with a contrastive loss on publicly available image-caption data (including YFCC100M and web crawls). The model was developed by OpenAI as a research output to study robustness and generalization in computer vision.

Key Strengths

Enables arbitrary classification without task-specific training — simply provide candidate labels as text.
Evaluated on a wide range of benchmarks including ImageNet, CIFAR-10/100, Food101, SUN397, Stanford Cars, FGVC Aircraft, DTD, MNIST, SVHN, MSCOCO, and many others (see the CLIP paper for full results).
Shows strong cross-modal understanding for tasks from OCR to texture recognition.

Best For

Zero-shot image classification and research into model robustness, bias, and generalization. Due to variability in performance across different class taxonomies, the model is best suited for controlled, research-oriented use cases where the taxonomy is fixed and thoroughly tested.

Notable Benchmark Results

In evaluations on the Fairface dataset (as reported in the model card):

Gender classification accuracy exceeded 96% across all racial groups, with “Middle Eastern” highest (98.4%) and “White” lowest (96.5%).
Racial classification accuracy averaged ~93%.
Age classification accuracy averaged ~63%.

Limitations

CLIP struggles with fine-grained classification and counting objects. Performance can shift significantly based on how classes are constructed. The model also exhibits biases with respect to race and gender, particularly in denigration tasks (e.g., classifying images into crime-related categories). As noted by the authors, any deployed use case requires thorough domain-specific testing.

not yet live

We're benchmarking and onboarding clip-vit-base-patch32 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.