clip-vit-large-patch14-336
openai/clip-vit-large-patch14-336
A popular open zero-shot image model, with 3.4M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
The openai/clip-vit-large-patch14-336 model is a large-scale vision-language model (CLIP) designed for zero-shot image classification. It uses a ViT-L/14 architecture with a 336x336 pixel input resolution.
Capabilities
This model performs zero-shot image classification by matching an input image against a set of natural language labels. It does not require task-specific fine-tuning; classification is performed by computing the cosine similarity between image and text embeddings.
Training Details
- Trained from scratch on an unspecified dataset.
- Optimizer: None
- Training precision: float32
- Framework versions: Transformers 4.21.3, TensorFlow 2.8.2, Tokenizers 0.12.1
Use Cases
Best suited for zero-shot image classification across arbitrary label sets, open-domain visual recognition, and multimodal retrieval tasks where pre-defined categories are not available.
Benchmark Results
No benchmark results are provided in the model card. Performance on standard zero-shot benchmarks (e.g., ImageNet) is not reported.
Hosted API
Gigarouter hosts this model as a managed, OpenAI-compatible API. Users can send images and candidate text labels via REST calls without managing infrastructure or installing dependencies.
We're benchmarking and onboarding clip-vit-large-patch14-336 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.