granite-vision-3.3-2b
ibm-granite/granite-vision-3.3-2b
A popular open image-to-text model, with 343.3K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
granite-vision-3.3-2b is an image-to-text model purpose-built for visual document understanding, enabling automated extraction of content from tables, charts, infographics, plots, diagrams, and general images. It is a compact, efficient vision-language model fine-tuned from a Granite large language model (granite-3.1-2b-instruct) with a SigLIP2 vision encoder and a two-layer MLP connector. The model is optimized for enterprise applications requiring high-accuracy OCR, document QA, and chart analysis, while also supporting general visual question answering.
Key Capabilities
- Document understanding: tables, charts, diagrams, infographics, and multi-page documents (up to 8 pages with recommended image resizing).
- Experimental features: image segmentation, doctags generation (structured text from document images), and multi-page QA.
- Enhanced safety alignment compared to prior Granite vision models.
Benchmark Performance
Evaluated on standard document and general vision benchmarks using the llms-eval framework:
| Benchmark | Granite-vision-3.1-2b-preview | Granite-vision-3.2-2b | Granite-vision-3.3-2b |
|---|---|---|---|
| Document benchmarks | |||
| ChartQA | 0.86 | 0.87 | 0.87 |
| DocVQA | 0.88 | 0.89 | 0.91 |
| TextVQA | 0.76 | 0.78 | 0.80 |
| AI2D | 0.78 | 0.76 | 0.77 |
| InfoVQA | 0.63 | 0.64 | 0.68 |
| OCRBench | 0.75 | 0.77 | 0.79 |
| LiveXiv VQA v2 | 0.61 | 0.61 | 0.61 |
| LiveXiv TQA v2 | 0.55 | 0.57 | 0.52 |
| Other benchmarks | |||
| MMMU | 0.35 | 0.37 | 0.37 |
| VQAv2 | 0.81 | 0.78 | 0.79 |
| RealWorldQA | 0.65 | 0.63 | 0.63 |
| VizWiz VQA | 0.64 | 0.63 | 0.62 |
| OK VQA | 0.57 | 0.56 | 0.55 |
Safety Evaluations
Safety alignment scores on RTVLM and VLGuard (higher is better, scale 0-10):
| RTVLM | Politics | Racial | Jailbreak | Mislead |
|---|---|---|---|---|
| Granite-vision-3.3-2b | 8.0 | 8.1 | 7.5 | 8.0 |
| VLGuard | Unsafe Images (Unsafe) | Safe Images + Unsafe Instructions |
|---|---|---|
| Granite-vision-3.3-2b | 8.4 | 9.3 |
Model Details
- Architecture: SigLIP2 vision encoder → two-layer MLP connector → granite-3.1-2b-instruct (128k context). Trained with LLaVA-style multi-layer features and AnyRes denser grid resolution.
- Input: English text and images (PNG, JPEG).
- License: Apache 2.0. Released June 11, 2025.
- Paper: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence (describes v3.2; v3.3 shares technical underpinnings with enhancements).
We're benchmarking and onboarding granite-vision-3.3-2b as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.