models / image-to-text · coming soon

blip-image-captioning-base

Salesforce/blip-image-captioning-base

A popular open image-to-text model, with 1.9M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

1.9M

license

bsd-3-clause

about this model

Salesforce/blip-image-captioning-base is an image-to-text model that generates natural language captions for images. It uses a Vision Transformer (ViT) base backbone and is pretrained on the COCO dataset.

Key capabilities

The model supports both conditional and unconditional image captioning. It is built on the BLIP (Bootstrapping Language-Image Pre-training) framework, which unifies vision-language understanding and generation tasks. BLIP improves performance by bootstrapping captions from noisy web data: a captioner generates synthetic captions and a filter removes low-quality ones.

Best for

Image captioning (conditional and unconditional)
Image-text retrieval
Visual question answering (VQA)
Zero-shot transfer to video-language tasks

Benchmark results (from the original paper)

Task	Improvement
Image-text retrieval	+2.7% average recall@1
Image captioning	+2.8% CIDEr
VQA	+1.6% VQA score

Example output

The model can generate captions such as “a woman sitting on the beach with her dog” from an input image. (Example image source: demo image.)

Hosted on gigarouter as a managed API, this model provides a single endpoint for image captioning, retrieval, and related vision-language tasks.

not yet live

We're benchmarking and onboarding blip-image-captioning-base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.