GUI-C2

Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

†,1Junlong Li, †,1Chao Hao, 1Lap-Pui Chau, *,1Yi Wang

1The Hong Kong Polytechnic University

†Equal contribution,*Corresponding author
Figure 1

Figure 1: Overview of GUI-C2. (a) Random Select vs. (b) GUI-D data curation. (c) Limitations of existing agentic RL methods. (d) Our coarse-to-fine framework. (e) Performance comparison.

🔔News

[2026.5]: 🤩 Our training dataset GUI-C2-4K released on HuggingFace.
[2026.5]: 🤩 Our paper is released on arXiv.

Abstract

Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C2, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. Code will be publicly available.

GUI-C2 Framework

GUI-D: Difficulty-Aware Data Curation

GUI-D is a lightweight data curation pipeline that removes low-value GUI grounding samples and assigns each remaining sample a difficulty score for GRPO training. It consists of:

  • Instruction & Length Filtering: Retain action-oriented instructions; remove short item-name prompts.
  • Cross-domain Filtering: Remove web-like samples found in non-web domains via OCR detection.
  • Rollout Informativeness Filtering: Remove samples where all 8 rollouts succeed or all fail; keep mixed-outcome samples.
  • Difficulty Scoring: A weighted sum of 8-click success rate, prediction dispersion, localization error, target size, and parsing failures, normalized per-platform.
Figure 2

Figure 2: Overview of GUI-D data curation pipeline.

GUI-C2: Coarse-to-Fine Policy

GUI-C2 simplifies the decision-making process into predicting the size of the target bounding box, which serves as both a refinement trigger and a dense supervision signal. Key components include:

  • Difficulty-Aware training weight adjustment: We incorporate sample-wise difficulty scores as interval-bounded loss weighting factors into the GRPO objective, which upweights gradients from challenging instances while limiting the maximum weight disparity to safeguard against overfitting to noisy difficulty estimates.
  • Area-gated refinement: The model predicts a bounding box on the full screenshot; if the predicted area is smaller than threshold τ1, it crops the region and predicts again (Stage 2); if still smaller than τ2, it crops once more (Stage 3).
  • No explicit thinking: Unlike other agentic frameworks, GUI-C2 removes the thinking stage entirely, reducing inference time significantly.
  • Improvement-aware stage rewards: Each refinement stage receives an auxiliary reward that encourages effective crops and penalizes harmful ones.
Figure 3

Figure 3: Overall framework of GUI-C2. The left panel illustrates the coarse-to-fine policy and reward design. The right panel shows difficulty-aware GRPO and auxiliary refinement rewards.

Key Contributions

  • GUI-D: A data filtering and dynamic weighting pipeline for GRPO that leverages the difficulty metric to optimize sample utilization, improving training efficiency.
  • GUI-C2: An efficient agentic RL framework for coarse-to-fine GUI grounding. By formulating multi-stage cropping as bounding box prediction, it applies distinct learning behaviors to simple and hard samples, reducing decision complexity while improving inference efficiency.
  • By combining dynamic difficulty weighting and coarse-to-fine cropping, our method achieves state-of-the-art performance under comparable conditions (3B) on three commonly used benchmarks, remarkably using only 4,624 training samples.

Leaderboard

ScreenSpot-Pro

Comparison of different agent models on ScreenSpot-Pro. Results marked in bold and underline represent the best and second-best performance. Categories: CAD, Scientific, Creative, and Development are grouped into icon-type and text-type columns.

Model CAD
icon
CAD
text
Scientific
icon
Scientific
text
Creative
icon
Creative
text
Develop.
icon
Develop.
text
Office
icon
Office
text
OS
icon
OS
text
Avg.
Proprietary Models
GPT-4o 2.0 0.0 1.3 0.0 1.0 0.0 2.1 0.0 1.1 0.0 0.0 0.0 0.8
Claude Computer Use 14.5 3.7 22.0 3.9 25.9 3.4 33.9 15.8 30.1 16.3 11.0 4.5 17.1
General Open-source Models
Qwen2.5-VL-3B 9.1 7.3 22.1 1.4 26.8 2.1 38.2 7.3 33.9 15.1 10.3 1.1 16.1
Qwen2.5-VL-7B 16.8 1.6 46.8 4.1 35.9 7.7 49.3 7.3 52.5 20.8 37.4 6.7 26.8
GUI-Specific Models (SFT+RL)
CogAgent-18B 7.1 3.1 14.9 0.7 9.6 0.0 22.2 1.8 13.0 0.0 5.6 0.0 7.7
OS-Atlas-7B 12.2 4.7 33.1 1.4 28.8 2.8 37.5 7.3 33.9 5.7 27.1 4.5 18.9
ShowUI-2B 2.5 0.0 16.9 1.4 9.1 0.0 13.2 7.3 15.3 7.5 10.3 2.2 7.7
UGround-7B 14.2 1.6 26.6 2.1 27.3 2.8 31.9 2.7 31.6 11.3 17.8 0.0 16.5
UGround-V1-7B 15.8 1.2 51.9 2.8 47.5 9.7 57.6 14.5 60.5 13.2 38.3 7.9 31.1
UI-TARS-2B 17.8 4.7 47.4 4.1 42.9 6.3 56.9 17.3 50.3 17.0 21.5 5.6 27.7
UI-TARS-7B 20.8 9.4 58.4 12.4 50.0 9.1 63.9 31.8 63.3 20.8 30.8 16.9 35.7
InfiGUI-R1-3B 33.0 14.1 51.3 12.4 44.9 7.0 58.3 20.0 65.5 28.3 43.9 12.4 35.7
GUI-Specific Models (RL Only)
UI-R1-3B 11.2 6.3 22.7 4.1 27.3 3.5 42.4 11.8 32.2 11.3 13.1 4.5 17.8
GUI-R1-3B 26.4 7.8 33.8 4.8 40.9 5.6 61.8 17.3 53.6 17.0 28.1 5.6 30.2
GUI-R1-7B 23.9 6.3 49.4 4.8 38.9 8.4 55.6 11.8 58.7 26.4 42.1 16.9 32.4
SE-GUI-3B 38.1 12.5 55.8 7.6 47.0 4.9 61.8 16.4 59.9 24.5 40.2 12.4 35.9
GUI-G1-3B 39.6 9.4 50.7 10.3 36.6 11.9 61.8 30.0 67.2 32.1 23.5 10.6 37.1
GUI-Eyes-3B 48.2 9.4 70.8 12.4 56.6 13.3 69.4 19.1 75.7 24.5 59.8 20.2 44.8
GUI-C2-3B (Ours) 43.7 21.9 72.1 19.3 56.6 14.7 68.8 21.8 74.6 37.7 57.9 27.0 46.4

Table 1: Performance comparison on ScreenSpot-Pro. CAD, Scientific, Creative, Development are sub-datasets; icon/text columns denote target type within each sub-dataset.

ScreenSpot & ScreenSpot-v2

Comparison of model performance on ScreenSpot and ScreenSpot-v2. Results marked in bold and underline represent the best and second-best performance.

Model Train Samples SS Mobile SS Desktop SS Web SS Avg. SSV2 Mobile SSV2 Desktop SSV2 Web SSV2 Avg.
Proprietary Models
GPT-4o - 21.9 17.8 9.4 18.8 22.5 22.2 12.4 20.1
General Open-source Models
Qwen2-VL-7B - 50.3 40.4 27.4 42.9 39.4 50.1 27.7 39.8
Qwen2.5-VL-3B - - - - 55.5 55.5 44.0 39.1 46.9
Qwen2.5-VL-7B - - - - 84.7 92.8 78.4 85.4 86.5
GUI-Specific Models
CogAgent-18B 222M 57.8 31.6 40.1 47.4 50.6 51.6 54.1 52.8
SeeClick-7B 1M 68.1 48.8 41.8 53.4 51.8 65.5 40.7 53.9
UGround-7B 10M 75.9 75.8 78.3 73.3 74.3 74.9 78.6 76.3
ShowUI-2B 256K 84.8 70.8 76.2 75.1 70.0 85.1 73.3 77.3
OS-Atlas-4B 13M 56.2 74.9 69.9 68.5 74.9 56.9 70.7 68.5
OS-Atlas-7B 13M 85.0 78.8 84.5 82.5 78.3 85.5 83.8 83.3
Aguvis-7B 1M 86.9 82.4 84.7 84.4 89.6 86.8 84.9 87.3
UI-TARS-2B 2M 85.0 81.4 79.8 82.3 87.9 81.4 82.9 84.7
GUI-C2-3B (Ours) 4.6K 85.5 87.1 85.1 85.8 88.2 88.6 86.7 87.8

Table 2: Comparison on ScreenSpot and ScreenSpot-v2. Bold = best, underline = second best.

GUI-C2-7B Results

Performance comparison on ScreenSpot-Pro. All methods use Qwen2/2.5-VL-7B as the base model.

Model ScreenSpot-Pro Avg.
Qwen2.5-VL-7B 26.8
GUI-R1-7B 32.4
JEDI-7B 39.5
GUI-Actor-7B 44.6
SE-GUI-7B 47.3
GUI-G2-7B 47.5
OpenCUA-7B 50.0
GTA1-7B 50.1
GUI-C2-7B (Ours) 50.8

Table 3: GUI-C2-7B on ScreenSpot-Pro.

Ablation Study

Ablation experiments on ScreenSpot-Pro (GUI-C2-3B).

Variant Inference Time (s) Avg.
GUI-C2 (Full) 3.05 46.4
w/o tool use 1.50 37.0
w/o coarse-to-fine policy 2.42 41.4
w/o difficulty-aware 3.05 43.7
w/ self-action-decide 24.81 43.5
adaptive crop ratio 3.05 43.8
w/ self-action-ratio-decide 32.71 43.4

Table 4: Ablation experiments on ScreenSpot-Pro.

Hyperparameter Study

Figure 4

Figure 4: Hyperparameter study on ScreenSpot-Pro.

Figure 5

Figure 5: Comparison of different maximum allowable crop stages on GUI-C2-3B.

Visualization Analysis

From left to right, the three examples show results from our testing under two-crop, one-crop, and direct-click actions, respectively. Among them, the results with two crops and one crop are correct, while the result with direct click is incorrect.

Figure 6

Figure 6: Visualization analysis.

Reward Function Parameters

Performance comparison of different reward function parameters on ScreenSpot-Pro (GUI-C2-3B).

λclick λiou λfmt Avg.
0.6 0.3 0.1 46.4
0.5 0.4 0.1 45.7
0.7 0.2 0.1 46.1

Table 5: Reward function parameter study on ScreenSpot-Pro.

BibTeX


@misc{li2026guic2coarsetofineguigrounding,
      title={GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning}, 
      author={Junlong Li and Chao Hao and Lap-Pui Chau and Yi Wang},
      year={2026},
      eprint={2605.30884},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.30884}, 
}