skip to content
gigarouter gigarouter
models / vision-language · coming soon

Qwen2-VL-2B-Instruct

Qwen/Qwen2-VL-2B-Instruct

A popular open vision-language model, with 3.6M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price
~$0.626
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
3.6M
license
apache-2.0

about this model

Qwen2-VL-2B-Instruct is a vision-language model (VLM) that processes images, multilingual text within images, and videos up to 20 minutes, and can be integrated with devices such as mobile phones and robots for visual-based decision-making.

Key Strengths

  • State-of-the-art understanding across diverse image resolutions and aspect ratios, enabled by Naive Dynamic Resolution.
  • Multimodal Rotary Position Embedding (M-ROPE) captures 1D text, 2D image, and 3D video positional information for enhanced multimodal processing.
  • Supports text in multiple languages (European languages, Japanese, Korean, Arabic, Vietnamese, etc.) inside images.
  • Capable of long-form video understanding for QA, dialog, and content creation.
Diagram of Naive Dynamic Resolution Diagram of Multimodal Rotary Position Embedding (M-ROPE)

Benchmark Performance

BenchmarkQwen2-VL-2BInternVL2-2BMiniCPM-V 2.0
MMMU41.136.338.2
DocVQA90.186.9-
InfoVQA65.558.9-
ChartQA73.576.2-
TextVQA79.773.4-
OCRBench794781605
RealWorldQA62.957.355.8
MMBench-EN74.973.269.1
MMVet49.539.741.0
HallBench41.738.036.1

Video benchmarks (Qwen2-VL-2B only): MVBench 63.2, PerceptionTest 53.9, EgoSchema 54.9, Video-MME 55.6/60.4.

This model is hosted as a managed, OpenAI-compatible API on gigarouter, eliminating setup overhead. Ideal for document analysis, visual question answering, video understanding, and multilingual OCR tasks.

not yet live

We're benchmarking and onboarding Qwen2-VL-2B-Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.