Qwen2.5-VL-7B-Instruct
Qwen/Qwen2.5-VL-7B-Instruct
A popular open vision-language model, with 9.8M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Qwen2.5-VL-7B-Instruct is a vision-language model hosted on gigarouter as an OpenAI-compatible API. It processes images, videos, and text to perform tasks such as visual question answering, document understanding, visual localization, and structured output generation.
Capabilities
- Visual understanding — recognizes objects, text, charts, icons, and layouts within images.
- Agentic behavior — can reason and dynamically direct tools, enabling computer and phone use.
- Long video comprehension — understands videos over 1 hour and can pinpoint specific events with temporal localization.
- Visual localization — outputs bounding boxes or points for objects, with stable JSON for coordinates and attributes.
- Structured outputs — extracts structured data from invoices, forms, and tables for finance and commerce applications.
Architecture
The model extends dynamic resolution to the temporal dimension via dynamic FPS sampling and updates mRoPE with absolute time alignment. Its vision encoder uses window attention, SwiGLU, and RMSNorm, aligned with the Qwen2.5 LLM backbone.

Benchmark Performance
Selected results on standard benchmarks:
| Benchmark | Score |
|---|---|
| DocVQA (test) | 95.7 |
| ChartQA (test) | 87.3 |
| MathVista (testmini) | 68.2 |
| OCRBench | 864 |
| Video-MME (w/ subs) | 71.6 |
| MVBench | 69.6 |
| ScreenSpot | 84.7 |
| Android Control (Low EM) | 93.7 |
On image benchmarks, Qwen2.5-VL-7B outperforms comparable models (InternVL2.5-8B, GPT-4o-mini, Qwen2-VL-7B) on tasks including MMMU-Pro, DocVQA, InfoVQA, ChartQA, MMVet, MathVista, and OCRBench. For video, it surpasses its predecessor on MVBench, PerceptionTest, and Video-MME. Agent benchmarks confirm strong performance in screen grounding and mobile control.
We're benchmarking and onboarding Qwen2.5-VL-7B-Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.