skip to content
gigarouter gigarouter
models / image-to-text · coming soon

kosmos-2-patch14-224

microsoft/kosmos-2-patch14-224

A popular open image-to-text model, with 166.7K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price
~$0.626
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
166.7K
license
mit

about this model

Microsoft Kosmos-2 is a multimodal large language model for image-to-text tasks that grounds its outputs to specific regions in an input image. It can generate captions, answer visual questions, and perform referring expression comprehension—all while linking phrases to bounding box coordinates. This allows models to not only describe what is in an image but also indicate where each described entity is located.

Key capabilities

  • Multimodal grounding: Phrase grounding and referring expression comprehension, where the model localizes objects mentioned in text.
  • Grounded visual question answering (VQA): Answers questions about an image and returns bounding boxes for relevant objects.
  • Referring expression generation: Given an object region (bounding box), produces a natural language description of that object.
  • Grounded image captioning: Generates brief or detailed captions with spatially aligned entity boxes.

Example output

An image of a snowman warming himself by a fire.

Given the image above, Kosmos-2 can produce the caption “An image of a snowman warming himself by a fire.” and output bounding boxes for “a snowman” and “a fire”.

Kosmos-2 is designed for applications that require fine-grained visual understanding with spatial grounding, such as interactive image analysis, grounded dialogue systems, and visual reasoning. As a hosted API on gigarouter, the model is available through an OpenAI-compatible endpoint, eliminating the need for local setup or GPU infrastructure.

not yet live

We're benchmarking and onboarding kosmos-2-patch14-224 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.