blip-image-captioning-large
Salesforce/blip-image-captioning-large
A popular open image-to-text model, with 752.9K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Salesforce/blip-image-captioning-large is an image-to-text model that generates descriptive captions for images. It uses a ViT large backbone and is pretrained on the COCO dataset. As part of the BLIP framework, it applies bootstrapping—a captioner generates synthetic captions and a filter removes noisy ones—to effectively leverage noisy web data.
Capabilities
The model supports both conditional and unconditional image captioning. It is designed for unified vision-language understanding and generation tasks, including image-text retrieval, image captioning, and visual question answering (VQA).
Performance
- Image-text retrieval: +2.7% improvement in average recall@1
- Image captioning: +2.8% improvement in CIDEr score
- VQA: +1.6% improvement in VQA score
BLIP also demonstrates strong zero-shot generalization to video-language tasks.
Best for
Developers seeking a single model that excels across multiple vision-language benchmarks, particularly for image captioning and retrieval. The model is hosted as a managed API on gigarouter, accessible via OpenAI-compatible endpoints.
We're benchmarking and onboarding blip-image-captioning-large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.