UI-AGILE 3B

KDEGroup/UI-AGILE-3B

published Aug 2025 · updated Apr 2026

UI-AGILE 3B is a vlm model that enhances GUI agents through reinforcement learning and precise inference-time grounding, based on Qwen2.5-VL.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

license

mit

specs

Task	GUI Agent Grounding
Architecture	Qwen2.5-VL
Parameters	8.29B
License	MIT

about this model

UI-AGILE-3B is a vision-language model for GUI grounding that incorporates the UI-AGILE framework’s training and inference enhancements, built on Qwen2.5-VL. The model is designed to address common challenges in multimodal GUI agents, including ineffective rewards, visual noise, and reasoning inefficiency.

Training Enhancements

The training pipeline introduces a continuous reward function that incentivizes high-precision grounding, a “Simple Thinking” reward to balance planning depth with execution speed and grounding accuracy, and a cropping-based resampling strategy that mitigates the sparse reward problem. Without cropping-based resampling, 19.1% of training samples would provide no learning signal; in the first epoch only 61.8% of training steps are fully successful on initial attempt. Trained on approximately 9k samples for 2 epochs, the model is released under the MIT license and has been accepted to CVPR 2026 Findings.

Inference Enhancements

At inference, decomposed grounding with selection dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Standard grounding on the UI-AGILE-7B variant completes ScreenSpot-Pro in 30 minutes; the decomposed grounding stage takes 35 minutes, and VLM-based selection adds 4 minutes.

Benchmark Performance

The UI-AGILE-7B variant achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2. Using both training and inference enhancements, it yields a 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. The UI-AGILE-3B model applies the same framework to offer efficient GUI grounding in a smaller parameter footprint.

best for

·High-precision GUI grounding on high-resolution displays
·Automated GUI agent tasks like clicking or selecting UI elements from natural language instructions
·Improving grounding accuracy in complex GUI environments

FAQ

What is the base architecture of UI-AGILE 3B?

It is based on Qwen2.5-VL.

What license is the model released under?

MIT license.

How does the decomposed grounding method improve accuracy?

It breaks high-resolution images into smaller parts during inference, dramatically improving grounding accuracy on high-resolution displays.

Which benchmarks does UI-AGILE 3B excel at?

It achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2.

How can I use this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key.

not yet live

We're benchmarking and onboarding UI-AGILE 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit