models / vision-language · coming soon

Florence-2-base

microsoft/Florence-2-base

A popular open vision-language model, with 2.6M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

2.6M

license

mit

about this model

Florence-2-base is a vision-language model (VLM) that processes images and text prompts to perform captioning, object detection, segmentation, OCR, and region-level tasks. It uses a sequence-to-sequence architecture trained on the FLD-5B dataset (5.4 billion annotations across 126 million images). As a hosted API on gigarouter, it accepts OpenAI-compatible API calls—no local installation required.

Key Capabilities

Generate captions (basic, detailed, highly detailed)
Detect objects and propose regions
Perform phrase grounding and dense region captioning
OCR with and without region output
Referential expression comprehension and segmentation

Zero-Shot Performance (0.23B params)

Benchmark	Metric	Score
COCO Caption	CIDEr	133.0
NoCaps	CIDEr	118.7
TextCaps	CIDEr	70.1
COCO Detection	mAP	34.7
Flickr30k	R@1	83.6
RefCOCO val	Accuracy	53.9
RefCOCO+ val	Accuracy	51.5
RefCOCOg val	Accuracy	66.3

Fine-Tuned Variant Available

The microsoft/Florence-2-base-ft model (also hosted) improves on downstream tasks, achieving 140.0 CIDEr on COCO Caption and 79.7% accuracy on VQAv2 test-dev, among other gains.

Ideal Use Cases

Multimodal search and retrieval
Image understanding for accessibility (captioning, OCR)
Visual question answering and robotic perception
Scene understanding with object detection and segmentation

not yet live

We're benchmarking and onboarding Florence-2-base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.