[2026.5]: 🤩 Our training dataset GUI-C2-4K released on HuggingFace.
[2026.5]: 🤩 Our paper is released on arXiv.
Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C2, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. Code will be publicly available.
GUI-D is a lightweight data curation pipeline that removes low-value GUI grounding samples and assigns each remaining sample a difficulty score for GRPO training. It consists of:
Figure 2: Overview of GUI-D data curation pipeline.
GUI-C2 simplifies the decision-making process into predicting the size of the target bounding box, which serves as both a refinement trigger and a dense supervision signal. Key components include:
Figure 3: Overall framework of GUI-C2. The left panel illustrates the coarse-to-fine policy and reward design. The right panel shows difficulty-aware GRPO and auxiliary refinement rewards.
Comparison of different agent models on ScreenSpot-Pro. Results marked in bold and underline represent the best and second-best performance. Categories: CAD, Scientific, Creative, and Development are grouped into icon-type and text-type columns.
| Model | CAD icon |
CAD text |
Scientific icon |
Scientific text |
Creative icon |
Creative text |
Develop. icon |
Develop. text |
Office icon |
Office text |
OS icon |
OS text |
Avg. |
| Proprietary Models | |||||||||||||
| GPT-4o | 2.0 | 0.0 | 1.3 | 0.0 | 1.0 | 0.0 | 2.1 | 0.0 | 1.1 | 0.0 | 0.0 | 0.0 | 0.8 |
| Claude Computer Use | 14.5 | 3.7 | 22.0 | 3.9 | 25.9 | 3.4 | 33.9 | 15.8 | 30.1 | 16.3 | 11.0 | 4.5 | 17.1 |
| General Open-source Models | |||||||||||||
| Qwen2.5-VL-3B | 9.1 | 7.3 | 22.1 | 1.4 | 26.8 | 2.1 | 38.2 | 7.3 | 33.9 | 15.1 | 10.3 | 1.1 | 16.1 |
| Qwen2.5-VL-7B | 16.8 | 1.6 | 46.8 | 4.1 | 35.9 | 7.7 | 49.3 | 7.3 | 52.5 | 20.8 | 37.4 | 6.7 | 26.8 |
| GUI-Specific Models (SFT+RL) | |||||||||||||
| CogAgent-18B | 7.1 | 3.1 | 14.9 | 0.7 | 9.6 | 0.0 | 22.2 | 1.8 | 13.0 | 0.0 | 5.6 | 0.0 | 7.7 |
| OS-Atlas-7B | 12.2 | 4.7 | 33.1 | 1.4 | 28.8 | 2.8 | 37.5 | 7.3 | 33.9 | 5.7 | 27.1 | 4.5 | 18.9 |
| ShowUI-2B | 2.5 | 0.0 | 16.9 | 1.4 | 9.1 | 0.0 | 13.2 | 7.3 | 15.3 | 7.5 | 10.3 | 2.2 | 7.7 |
| UGround-7B | 14.2 | 1.6 | 26.6 | 2.1 | 27.3 | 2.8 | 31.9 | 2.7 | 31.6 | 11.3 | 17.8 | 0.0 | 16.5 |
| UGround-V1-7B | 15.8 | 1.2 | 51.9 | 2.8 | 47.5 | 9.7 | 57.6 | 14.5 | 60.5 | 13.2 | 38.3 | 7.9 | 31.1 |
| UI-TARS-2B | 17.8 | 4.7 | 47.4 | 4.1 | 42.9 | 6.3 | 56.9 | 17.3 | 50.3 | 17.0 | 21.5 | 5.6 | 27.7 |
| UI-TARS-7B | 20.8 | 9.4 | 58.4 | 12.4 | 50.0 | 9.1 | 63.9 | 31.8 | 63.3 | 20.8 | 30.8 | 16.9 | 35.7 |
| InfiGUI-R1-3B | 33.0 | 14.1 | 51.3 | 12.4 | 44.9 | 7.0 | 58.3 | 20.0 | 65.5 | 28.3 | 43.9 | 12.4 | 35.7 |
| GUI-Specific Models (RL Only) | |||||||||||||
| UI-R1-3B | 11.2 | 6.3 | 22.7 | 4.1 | 27.3 | 3.5 | 42.4 | 11.8 | 32.2 | 11.3 | 13.1 | 4.5 | 17.8 |
| GUI-R1-3B | 26.4 | 7.8 | 33.8 | 4.8 | 40.9 | 5.6 | 61.8 | 17.3 | 53.6 | 17.0 | 28.1 | 5.6 | 30.2 |
| GUI-R1-7B | 23.9 | 6.3 | 49.4 | 4.8 | 38.9 | 8.4 | 55.6 | 11.8 | 58.7 | 26.4 | 42.1 | 16.9 | 32.4 |
| SE-GUI-3B | 38.1 | 12.5 | 55.8 | 7.6 | 47.0 | 4.9 | 61.8 | 16.4 | 59.9 | 24.5 | 40.2 | 12.4 | 35.9 |
| GUI-G1-3B | 39.6 | 9.4 | 50.7 | 10.3 | 36.6 | 11.9 | 61.8 | 30.0 | 67.2 | 32.1 | 23.5 | 10.6 | 37.1 |
| GUI-Eyes-3B | 48.2 | 9.4 | 70.8 | 12.4 | 56.6 | 13.3 | 69.4 | 19.1 | 75.7 | 24.5 | 59.8 | 20.2 | 44.8 |
| GUI-C2-3B (Ours) | 43.7 | 21.9 | 72.1 | 19.3 | 56.6 | 14.7 | 68.8 | 21.8 | 74.6 | 37.7 | 57.9 | 27.0 | 46.4 |
Table 1: Performance comparison on ScreenSpot-Pro. CAD, Scientific, Creative, Development are sub-datasets; icon/text columns denote target type within each sub-dataset.
Comparison of model performance on ScreenSpot and ScreenSpot-v2. Results marked in bold and underline represent the best and second-best performance.
| Model | Train Samples | SS Mobile | SS Desktop | SS Web | SS Avg. | SSV2 Mobile | SSV2 Desktop | SSV2 Web | SSV2 Avg. |
| Proprietary Models | |||||||||
| GPT-4o | - | 21.9 | 17.8 | 9.4 | 18.8 | 22.5 | 22.2 | 12.4 | 20.1 |
| General Open-source Models | |||||||||
| Qwen2-VL-7B | - | 50.3 | 40.4 | 27.4 | 42.9 | 39.4 | 50.1 | 27.7 | 39.8 |
| Qwen2.5-VL-3B | - | - | - | - | 55.5 | 55.5 | 44.0 | 39.1 | 46.9 |
| Qwen2.5-VL-7B | - | - | - | - | 84.7 | 92.8 | 78.4 | 85.4 | 86.5 |
| GUI-Specific Models | |||||||||
| CogAgent-18B | 222M | 57.8 | 31.6 | 40.1 | 47.4 | 50.6 | 51.6 | 54.1 | 52.8 |
| SeeClick-7B | 1M | 68.1 | 48.8 | 41.8 | 53.4 | 51.8 | 65.5 | 40.7 | 53.9 |
| UGround-7B | 10M | 75.9 | 75.8 | 78.3 | 73.3 | 74.3 | 74.9 | 78.6 | 76.3 |
| ShowUI-2B | 256K | 84.8 | 70.8 | 76.2 | 75.1 | 70.0 | 85.1 | 73.3 | 77.3 |
| OS-Atlas-4B | 13M | 56.2 | 74.9 | 69.9 | 68.5 | 74.9 | 56.9 | 70.7 | 68.5 |
| OS-Atlas-7B | 13M | 85.0 | 78.8 | 84.5 | 82.5 | 78.3 | 85.5 | 83.8 | 83.3 |
| Aguvis-7B | 1M | 86.9 | 82.4 | 84.7 | 84.4 | 89.6 | 86.8 | 84.9 | 87.3 |
| UI-TARS-2B | 2M | 85.0 | 81.4 | 79.8 | 82.3 | 87.9 | 81.4 | 82.9 | 84.7 |
| GUI-C2-3B (Ours) | 4.6K | 85.5 | 87.1 | 85.1 | 85.8 | 88.2 | 88.6 | 86.7 | 87.8 |
Table 2: Comparison on ScreenSpot and ScreenSpot-v2. Bold = best, underline = second best.
Performance comparison on ScreenSpot-Pro. All methods use Qwen2/2.5-VL-7B as the base model.
| Model | ScreenSpot-Pro Avg. |
| Qwen2.5-VL-7B | 26.8 |
| GUI-R1-7B | 32.4 |
| JEDI-7B | 39.5 |
| GUI-Actor-7B | 44.6 |
| SE-GUI-7B | 47.3 |
| GUI-G2-7B | 47.5 |
| OpenCUA-7B | 50.0 |
| GTA1-7B | 50.1 |
| GUI-C2-7B (Ours) | 50.8 |
Table 3: GUI-C2-7B on ScreenSpot-Pro.
Ablation experiments on ScreenSpot-Pro (GUI-C2-3B).
| Variant | Inference Time (s) | Avg. |
| GUI-C2 (Full) | 3.05 | 46.4 |
| w/o tool use | 1.50 | 37.0 |
| w/o coarse-to-fine policy | 2.42 | 41.4 |
| w/o difficulty-aware | 3.05 | 43.7 |
| w/ self-action-decide | 24.81 | 43.5 |
| adaptive crop ratio | 3.05 | 43.8 |
| w/ self-action-ratio-decide | 32.71 | 43.4 |
Table 4: Ablation experiments on ScreenSpot-Pro.
Figure 4: Hyperparameter study on ScreenSpot-Pro.
Figure 5: Comparison of different maximum allowable crop stages on GUI-C2-3B.
From left to right, the three examples show results from our testing under two-crop, one-crop, and direct-click actions, respectively. Among them, the results with two crops and one crop are correct, while the result with direct click is incorrect.
Figure 6: Visualization analysis.
Performance comparison of different reward function parameters on ScreenSpot-Pro (GUI-C2-3B).
| λclick | λiou | λfmt | Avg. |
| 0.6 | 0.3 | 0.1 | 46.4 |
| 0.5 | 0.4 | 0.1 | 45.7 |
| 0.7 | 0.2 | 0.1 | 46.1 |
Table 5: Reward function parameter study on ScreenSpot-Pro.
@misc{li2026guic2coarsetofineguigrounding,
title={GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning},
author={Junlong Li and Chao Hao and Lap-Pui Chau and Yi Wang},
year={2026},
eprint={2605.30884},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.30884},
}