skip to content
gigarouter gigarouter
models / vision-language · coming soon

Florence-2-base

microsoft/Florence-2-base

A popular open vision-language model, with 2.6M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
2.6M
license
mit

about this model

Florence-2-base is a vision-language model (VLM) that processes images and text prompts to perform captioning, object detection, segmentation, OCR, and region-level tasks. It uses a sequence-to-sequence architecture trained on the FLD-5B dataset (5.4 billion annotations across 126 million images). As a hosted API on gigarouter, it accepts OpenAI-compatible API calls—no local installation required.

Key Capabilities

  • Generate captions (basic, detailed, highly detailed)
  • Detect objects and propose regions
  • Perform phrase grounding and dense region captioning
  • OCR with and without region output
  • Referential expression comprehension and segmentation

Zero-Shot Performance (0.23B params)

BenchmarkMetricScore
COCO CaptionCIDEr133.0
NoCapsCIDEr118.7
TextCapsCIDEr70.1
COCO DetectionmAP34.7
Flickr30kR@183.6
RefCOCO valAccuracy53.9
RefCOCO+ valAccuracy51.5
RefCOCOg valAccuracy66.3

Fine-Tuned Variant Available

The microsoft/Florence-2-base-ft model (also hosted) improves on downstream tasks, achieving 140.0 CIDEr on COCO Caption and 79.7% accuracy on VQAv2 test-dev, among other gains.

Ideal Use Cases

  • Multimodal search and retrieval
  • Image understanding for accessibility (captioning, OCR)
  • Visual question answering and robotic perception
  • Scene understanding with object detection and segmentation
not yet live

We're benchmarking and onboarding Florence-2-base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.