UI-AGILE 3B
KDEGroup/UI-AGILE-3B
published Aug 2025 · updated Apr 2026
UI-AGILE 3B is a vlm model that enhances GUI agents through reinforcement learning and precise inference-time grounding, based on Qwen2.5-VL.
specs
| Task | GUI Agent Grounding |
| Architecture | Qwen2.5-VL |
| Parameters | 8.29B |
| License | MIT |
about this model
UI-AGILE-3B is a vision-language model for GUI grounding that incorporates the UI-AGILE framework’s training and inference enhancements, built on Qwen2.5-VL. The model is designed to address common challenges in multimodal GUI agents, including ineffective rewards, visual noise, and reasoning inefficiency.
Training Enhancements
The training pipeline introduces a continuous reward function that incentivizes high-precision grounding, a “Simple Thinking” reward to balance planning depth with execution speed and grounding accuracy, and a cropping-based resampling strategy that mitigates the sparse reward problem. Without cropping-based resampling, 19.1% of training samples would provide no learning signal; in the first epoch only 61.8% of training steps are fully successful on initial attempt. Trained on approximately 9k samples for 2 epochs, the model is released under the MIT license and has been accepted to CVPR 2026 Findings.
Inference Enhancements
At inference, decomposed grounding with selection dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Standard grounding on the UI-AGILE-7B variant completes ScreenSpot-Pro in 30 minutes; the decomposed grounding stage takes 35 minutes, and VLM-based selection adds 4 minutes.
Benchmark Performance
The UI-AGILE-7B variant achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2. Using both training and inference enhancements, it yields a 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. The UI-AGILE-3B model applies the same framework to offer efficient GUI grounding in a smaller parameter footprint.
best for
- ·High-precision GUI grounding on high-resolution displays
- ·Automated GUI agent tasks like clicking or selecting UI elements from natural language instructions
- ·Improving grounding accuracy in complex GUI environments
FAQ
It is based on Qwen2.5-VL.
MIT license.
It breaks high-resolution images into smaller parts during inference, dramatically improving grounding accuracy on high-resolution displays.
It achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2.
Use the gigarouter OpenAI-compatible endpoint with an API key.
We're benchmarking and onboarding UI-AGILE 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.