skip to content
gigarouter gigarouter
models / image-to-text · coming soon

granite-vision-3.3-2b

ibm-granite/granite-vision-3.3-2b

A popular open image-to-text model, with 343.3K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price
~$0.626
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
343.3K
license
apache-2.0

about this model

granite-vision-3.3-2b is an image-to-text model purpose-built for visual document understanding, enabling automated extraction of content from tables, charts, infographics, plots, diagrams, and general images. It is a compact, efficient vision-language model fine-tuned from a Granite large language model (granite-3.1-2b-instruct) with a SigLIP2 vision encoder and a two-layer MLP connector. The model is optimized for enterprise applications requiring high-accuracy OCR, document QA, and chart analysis, while also supporting general visual question answering.

Key Capabilities

  • Document understanding: tables, charts, diagrams, infographics, and multi-page documents (up to 8 pages with recommended image resizing).
  • Experimental features: image segmentation, doctags generation (structured text from document images), and multi-page QA.
  • Enhanced safety alignment compared to prior Granite vision models.

Benchmark Performance

Evaluated on standard document and general vision benchmarks using the llms-eval framework:

BenchmarkGranite-vision-3.1-2b-previewGranite-vision-3.2-2bGranite-vision-3.3-2b
Document benchmarks
ChartQA0.860.870.87
DocVQA0.880.890.91
TextVQA0.760.780.80
AI2D0.780.760.77
InfoVQA0.630.640.68
OCRBench0.750.770.79
LiveXiv VQA v20.610.610.61
LiveXiv TQA v20.550.570.52
Other benchmarks
MMMU0.350.370.37
VQAv20.810.780.79
RealWorldQA0.650.630.63
VizWiz VQA0.640.630.62
OK VQA0.570.560.55

Safety Evaluations

Safety alignment scores on RTVLM and VLGuard (higher is better, scale 0-10):

RTVLMPoliticsRacialJailbreakMislead
Granite-vision-3.3-2b8.08.17.58.0
VLGuardUnsafe Images (Unsafe)Safe Images + Unsafe Instructions
Granite-vision-3.3-2b8.49.3

Model Details

  • Architecture: SigLIP2 vision encoder → two-layer MLP connector → granite-3.1-2b-instruct (128k context). Trained with LLaVA-style multi-layer features and AnyRes denser grid resolution.
  • Input: English text and images (PNG, JPEG).
  • License: Apache 2.0. Released June 11, 2025.
  • Paper: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence (describes v3.2; v3.3 shares technical underpinnings with enhancements).
not yet live

We're benchmarking and onboarding granite-vision-3.3-2b as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.