skip to content
gigarouter gigarouter
models / vision-language · coming soon

UI-AGILE 3B

KDEGroup/UI-AGILE-3B

published Aug 2025 · updated Apr 2026

UI-AGILE 3B is a vlm model that enhances GUI agents through reinforcement learning and precise inference-time grounding, based on Qwen2.5-VL.

est. price
~$0.626
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
5
license
mit

specs

TaskGUI Agent Grounding
ArchitectureQwen2.5-VL
Parameters8.29B
LicenseMIT

about this model

UI-AGILE-3B is a vision-language model for GUI grounding that incorporates the UI-AGILE framework’s training and inference enhancements, built on Qwen2.5-VL. The model is designed to address common challenges in multimodal GUI agents, including ineffective rewards, visual noise, and reasoning inefficiency.

Training Enhancements

The training pipeline introduces a continuous reward function that incentivizes high-precision grounding, a “Simple Thinking” reward to balance planning depth with execution speed and grounding accuracy, and a cropping-based resampling strategy that mitigates the sparse reward problem. Without cropping-based resampling, 19.1% of training samples would provide no learning signal; in the first epoch only 61.8% of training steps are fully successful on initial attempt. Trained on approximately 9k samples for 2 epochs, the model is released under the MIT license and has been accepted to CVPR 2026 Findings.

Inference Enhancements

At inference, decomposed grounding with selection dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Standard grounding on the UI-AGILE-7B variant completes ScreenSpot-Pro in 30 minutes; the decomposed grounding stage takes 35 minutes, and VLM-based selection adds 4 minutes.

Benchmark Performance

The UI-AGILE-7B variant achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2. Using both training and inference enhancements, it yields a 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. The UI-AGILE-3B model applies the same framework to offer efficient GUI grounding in a smaller parameter footprint.

best for

FAQ

What is the base architecture of UI-AGILE 3B?

It is based on Qwen2.5-VL.

What license is the model released under?

MIT license.

How does the decomposed grounding method improve accuracy?

It breaks high-resolution images into smaller parts during inference, dramatically improving grounding accuracy on high-resolution displays.

Which benchmarks does UI-AGILE 3B excel at?

It achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2.

How can I use this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key.

not yet live

We're benchmarking and onboarding UI-AGILE 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →