models / image-to-text · coming soon

granite-vision-3.3-2b

ibm-granite/granite-vision-3.3-2b

A popular open image-to-text model, with 343.3K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

343.3K

license

apache-2.0

about this model

granite-vision-3.3-2b is an image-to-text model purpose-built for visual document understanding, enabling automated extraction of content from tables, charts, infographics, plots, diagrams, and general images. It is a compact, efficient vision-language model fine-tuned from a Granite large language model (granite-3.1-2b-instruct) with a SigLIP2 vision encoder and a two-layer MLP connector. The model is optimized for enterprise applications requiring high-accuracy OCR, document QA, and chart analysis, while also supporting general visual question answering.

Key Capabilities

Document understanding: tables, charts, diagrams, infographics, and multi-page documents (up to 8 pages with recommended image resizing).
Experimental features: image segmentation, doctags generation (structured text from document images), and multi-page QA.
Enhanced safety alignment compared to prior Granite vision models.

Benchmark Performance

Evaluated on standard document and general vision benchmarks using the llms-eval framework:

Benchmark	Granite-vision-3.1-2b-preview	Granite-vision-3.2-2b	Granite-vision-3.3-2b
Document benchmarks
ChartQA	0.86	0.87	0.87
DocVQA	0.88	0.89	0.91
TextVQA	0.76	0.78	0.80
AI2D	0.78	0.76	0.77
InfoVQA	0.63	0.64	0.68
OCRBench	0.75	0.77	0.79
LiveXiv VQA v2	0.61	0.61	0.61
LiveXiv TQA v2	0.55	0.57	0.52
Other benchmarks
MMMU	0.35	0.37	0.37
VQAv2	0.81	0.78	0.79
RealWorldQA	0.65	0.63	0.63
VizWiz VQA	0.64	0.63	0.62
OK VQA	0.57	0.56	0.55

Safety Evaluations

Safety alignment scores on RTVLM and VLGuard (higher is better, scale 0-10):

RTVLM	Politics	Racial	Jailbreak	Mislead
Granite-vision-3.3-2b	8.0	8.1	7.5	8.0

VLGuard	Unsafe Images (Unsafe)	Safe Images + Unsafe Instructions
Granite-vision-3.3-2b	8.4	9.3

Model Details

Architecture: SigLIP2 vision encoder → two-layer MLP connector → granite-3.1-2b-instruct (128k context). Trained with LLaVA-style multi-layer features and AnyRes denser grid resolution.
Input: English text and images (PNG, JPEG).
License: Apache 2.0. Released June 11, 2025.
Paper: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence (describes v3.2; v3.3 shares technical underpinnings with enhancements).

not yet live

We're benchmarking and onboarding granite-vision-3.3-2b as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.