models / vision-language · coming soon

Qwen2.5-VL-3B-Instruct

Qwen/Qwen2.5-VL-3B-Instruct

A popular open vision-language model, with 5.3M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

5.3M

about this model

Gigarouter hosts Qwen2.5-VL-3B-Instruct, a 3-billion-parameter vision-language model (VLM) designed for multimodal understanding and visual agent tasks. It builds on Qwen2-VL with enhancements in text/chart analysis, long-video comprehension, visual grounding, and structured output generation.

Key Capabilities

Visual understanding: Recognizes objects, texts, charts, icons, graphics, and layouts within images.
Agentic use: Functions as a visual agent capable of directing tools, computer use, and phone use.
Long video comprehension: Processes videos over one hour and pinpoints relevant segments for specific events.
Visual localization: Generates bounding boxes or points for objects with JSON coordinate outputs.
Structured outputs: Extracts content from invoices, forms, tables, and similar documents.

Architecture Highlights

Utilizes dynamic resolution and dynamic FPS sampling for video understanding, with updated mRoPE for temporal alignment. The vision encoder integrates window attention, SwiGLU, and RMSNorm for efficiency.

Benchmark Performance

Selected results on image, video, and agent benchmarks:

Benchmark	Score
MMMU (val)	53.1
DocVQA (test)	93.9
InfoVQA (test)	77.1
MathVista (testmini)	62.3
VideoMME (no subs)	67.6
MLVU	68.2
ScreenSpot	55.5
AndroidWorld (SR)	90.8
MobileMiniWob++ (SR)	67.9

For detailed comparisons with peer models (InternVL2.5-4B, Qwen2-VL-7B), see the full model card. Further information: Blog and GitHub.

not yet live

We're benchmarking and onboarding Qwen2.5-VL-3B-Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.