models / image-to-text · coming soon

blip-image-captioning-large

Salesforce/blip-image-captioning-large

A popular open image-to-text model, with 752.9K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

752.9K

license

bsd-3-clause

about this model

Salesforce/blip-image-captioning-large is an image-to-text model that generates descriptive captions for images. It uses a ViT large backbone and is pretrained on the COCO dataset. As part of the BLIP framework, it applies bootstrapping—a captioner generates synthetic captions and a filter removes noisy ones—to effectively leverage noisy web data.

Capabilities

The model supports both conditional and unconditional image captioning. It is designed for unified vision-language understanding and generation tasks, including image-text retrieval, image captioning, and visual question answering (VQA).

Performance

Image-text retrieval: +2.7% improvement in average recall@1
Image captioning: +2.8% improvement in CIDEr score
VQA: +1.6% improvement in VQA score

BLIP also demonstrates strong zero-shot generalization to video-language tasks.

Best for

Developers seeking a single model that excels across multiple vision-language benchmarks, particularly for image captioning and retrieval. The model is hosted as a managed API on gigarouter, accessible via OpenAI-compatible endpoints.

not yet live

We're benchmarking and onboarding blip-image-captioning-large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.