blip-image-captioning-base
Salesforce/blip-image-captioning-base
A popular open image-to-text model, with 1.9M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Salesforce/blip-image-captioning-base is an image-to-text model that generates natural language captions for images. It uses a Vision Transformer (ViT) base backbone and is pretrained on the COCO dataset.
Key capabilities
The model supports both conditional and unconditional image captioning. It is built on the BLIP (Bootstrapping Language-Image Pre-training) framework, which unifies vision-language understanding and generation tasks. BLIP improves performance by bootstrapping captions from noisy web data: a captioner generates synthetic captions and a filter removes low-quality ones.
Best for
- Image captioning (conditional and unconditional)
- Image-text retrieval
- Visual question answering (VQA)
- Zero-shot transfer to video-language tasks
Benchmark results (from the original paper)
| Task | Improvement |
|---|---|
| Image-text retrieval | +2.7% average recall@1 |
| Image captioning | +2.8% CIDEr |
| VQA | +1.6% VQA score |
Example output
The model can generate captions such as “a woman sitting on the beach with her dog” from an input image. (Example image source: demo image.)
Hosted on gigarouter as a managed API, this model provides a single endpoint for image captioning, retrieval, and related vision-language tasks.
We're benchmarking and onboarding blip-image-captioning-base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.