Florence-2-base
microsoft/Florence-2-base
A popular open vision-language model, with 2.6M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Florence-2-base is a vision-language model (VLM) that processes images and text prompts to perform captioning, object detection, segmentation, OCR, and region-level tasks. It uses a sequence-to-sequence architecture trained on the FLD-5B dataset (5.4 billion annotations across 126 million images). As a hosted API on gigarouter, it accepts OpenAI-compatible API calls—no local installation required.
Key Capabilities
- Generate captions (basic, detailed, highly detailed)
- Detect objects and propose regions
- Perform phrase grounding and dense region captioning
- OCR with and without region output
- Referential expression comprehension and segmentation
Zero-Shot Performance (0.23B params)
| Benchmark | Metric | Score |
|---|---|---|
| COCO Caption | CIDEr | 133.0 |
| NoCaps | CIDEr | 118.7 |
| TextCaps | CIDEr | 70.1 |
| COCO Detection | mAP | 34.7 |
| Flickr30k | R@1 | 83.6 |
| RefCOCO val | Accuracy | 53.9 |
| RefCOCO+ val | Accuracy | 51.5 |
| RefCOCOg val | Accuracy | 66.3 |
Fine-Tuned Variant Available
The microsoft/Florence-2-base-ft model (also hosted) improves on downstream tasks, achieving 140.0 CIDEr on COCO Caption and 79.7% accuracy on VQAv2 test-dev, among other gains.
Ideal Use Cases
- Multimodal search and retrieval
- Image understanding for accessibility (captioning, OCR)
- Visual question answering and robotic perception
- Scene understanding with object detection and segmentation
We're benchmarking and onboarding Florence-2-base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.