GTA1 7B
Salesforce/GTA1-7B
published Oct 2025 · updated Oct 2025
GTA1 7B is a vision-language model optimized for GUI grounding and agent tasks, using reinforcement learning (GRPO) to produce accurate click predictions from screenshots.
specs
| Task | GUI Grounding & Agent |
| Architecture | Qwen2-VL-based |
| Parameters | 7B |
about this model
GTA1-7B is a vision-language model for GUI grounding and agent task execution, trained using reinforcement learning (GRPO) to optimize for successful clicks on interface elements rather than verbose chain-of-thought reasoning. It is part of the GTA1 family that achieves state-of-the-art results on both grounding and agent benchmarks.
Grounding Performance
Evaluated on three standard benchmarks, GTA1 (7B) outperforms all open-source models at its size and competes with much larger proprietary systems:
| Model | ScreenSpot-V2 | ScreenSpotPro | OSWORLD-G | OSWORLD-G-Refined |
|---|---|---|---|---|
| GTA1 (7B) | 93.4 | 55.5 | 60.1 | 68.8 |
| OpenCUA (7B) | 92.3 | 50.0 | 55.3 | 68.3 |
| UI-TARS-1.5* (7B) | 89.7 | 42.0 | 52.8 | 64.2 |
On the ScreenSpotPro leaderboard (applications under 12B parameters), GTA1-7B achieves a micro-average of 55.5 and is ranked #10 overall. Per-application results include: Android Studio macOS 48.8, AutoCAD Windows 41.2, Blender Windows 57.7, DaVinci Resolve macOS 56.8.
Agent Task Execution
When paired with a planner (e.g., o3 or GPT-5), GTA1-7B achieves strong results on end-to-end agent benchmarks:
| Agent model | OSWorld | OSWorld-Verified |
|---|---|---|
| GTA1-7B-2507 w/ o3 | 45.2 | 53.1 |
| GTA1-7B-2507 w/ GPT-5 | — | 61.0 |
| UI-TARS-1.5-7B | 26.9 | 27.4 |
On WindowsAgentArena, GTA1-7B-2507 with o3 achieves a 47.9% success rate (100 steps), and with GPT-5 reaches 49.2%.
The model is hosted as a managed, OpenAI-compatible API on gigarouter, requiring no local installation or infrastructure.
best for
- ·Automated UI testing with precise click targeting
- ·GUI agent task execution on desktop and web environments
FAQ
The model takes an image (screenshot) and a text instruction, such as "click start". It outputs a predicted click coordinate (x, y).
It is fine-tuned using GRPO (Group Relative Policy Optimization), which rewards any click within the target bounding box rather than forcing exact center prediction.
Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and instruction in the standard chat completion format.
We're benchmarking and onboarding GTA1 7B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.