models / zero-shot image · coming soon

clip-vit-large-patch14-336

openai/clip-vit-large-patch14-336

A popular open zero-shot image model, with 3.4M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

3.4M

about this model

The openai/clip-vit-large-patch14-336 model is a large-scale vision-language model (CLIP) designed for zero-shot image classification. It uses a ViT-L/14 architecture with a 336x336 pixel input resolution.

Capabilities

This model performs zero-shot image classification by matching an input image against a set of natural language labels. It does not require task-specific fine-tuning; classification is performed by computing the cosine similarity between image and text embeddings.

Training Details

Trained from scratch on an unspecified dataset.
Optimizer: None
Training precision: float32
Framework versions: Transformers 4.21.3, TensorFlow 2.8.2, Tokenizers 0.12.1

Use Cases

Best suited for zero-shot image classification across arbitrary label sets, open-domain visual recognition, and multimodal retrieval tasks where pre-defined categories are not available.

Benchmark Results

No benchmark results are provided in the model card. Performance on standard zero-shot benchmarks (e.g., ImageNet) is not reported.

Hosted API

Gigarouter hosts this model as a managed, OpenAI-compatible API. Users can send images and candidate text labels via REST calls without managing infrastructure or installing dependencies.

not yet live

We're benchmarking and onboarding clip-vit-large-patch14-336 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.