GUI-C$^2$

🔔News

[2026.5]: 🤩 Our training dataset GUI-C2-4K released on HuggingFace.
[2026.5]: 🤩 Our paper is released on arXiv.

Abstract

Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C², which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. Code will be publicly available.

GUI-D: Difficulty-Aware Data Curation

GUI-D is a lightweight data curation pipeline that removes low-value GUI grounding samples and assigns each remaining sample a difficulty score for GRPO training. It consists of:

Instruction & Length Filtering: Retain action-oriented instructions; remove short item-name prompts.
Cross-domain Filtering: Remove web-like samples found in non-web domains via OCR detection.
Rollout Informativeness Filtering: Remove samples where all 8 rollouts succeed or all fail; keep mixed-outcome samples.
Difficulty Scoring: A weighted sum of 8-click success rate, prediction dispersion, localization error, target size, and parsing failures, normalized per-platform.

Figure 2: Overview of GUI-D data curation pipeline.

GUI-C²: Coarse-to-Fine Policy

GUI-C² simplifies the decision-making process into predicting the size of the target bounding box, which serves as both a refinement trigger and a dense supervision signal. Key components include:

Difficulty-Aware training weight adjustment: We incorporate sample-wise difficulty scores as interval-bounded loss weighting factors into the GRPO objective, which upweights gradients from challenging instances while limiting the maximum weight disparity to safeguard against overfitting to noisy difficulty estimates.
Area-gated refinement: The model predicts a bounding box on the full screenshot; if the predicted area is smaller than threshold τ₁, it crops the region and predicts again (Stage 2); if still smaller than τ₂, it crops once more (Stage 3).
No explicit thinking: Unlike other agentic frameworks, GUI-C² removes the thinking stage entirely, reducing inference time significantly.
Improvement-aware stage rewards: Each refinement stage receives an auxiliary reward that encourages effective crops and penalizes harmful ones.

Figure 3: Overall framework of GUI-C². The left panel illustrates the coarse-to-fine policy and reward design. The right panel shows difficulty-aware GRPO and auxiliary refinement rewards.

Key Contributions

GUI-D: A data filtering and dynamic weighting pipeline for GRPO that leverages the difficulty metric to optimize sample utilization, improving training efficiency.
GUI-C²: An efficient agentic RL framework for coarse-to-fine GUI grounding. By formulating multi-stage cropping as bounding box prediction, it applies distinct learning behaviors to simple and hard samples, reducing decision complexity while improving inference efficiency.
By combining dynamic difficulty weighting and coarse-to-fine cropping, our method achieves state-of-the-art performance under comparable conditions (3B) on three commonly used benchmarks, remarkably using only 4,624 training samples.

ScreenSpot-Pro

Comparison of different agent models on ScreenSpot-Pro. Results marked in bold and underline represent the best and second-best performance. Categories: CAD, Scientific, Creative, and Development are grouped into icon-type and text-type columns.

Model	CAD icon	CAD text	Scientific icon	Scientific text	Creative icon	Creative text	Develop. icon	Develop. text	Office icon	Office text	OS icon	OS text	Avg.
Proprietary Models
GPT-4o	2.0	0.0	1.3	0.0	1.0	0.0	2.1	0.0	1.1	0.0	0.0	0.0	0.8
Claude Computer Use	14.5	3.7	22.0	3.9	25.9	3.4	33.9	15.8	30.1	16.3	11.0	4.5	17.1
General Open-source Models
Qwen2.5-VL-3B	9.1	7.3	22.1	1.4	26.8	2.1	38.2	7.3	33.9	15.1	10.3	1.1	16.1
Qwen2.5-VL-7B	16.8	1.6	46.8	4.1	35.9	7.7	49.3	7.3	52.5	20.8	37.4	6.7	26.8
GUI-Specific Models (SFT+RL)
CogAgent-18B	7.1	3.1	14.9	0.7	9.6	0.0	22.2	1.8	13.0	0.0	5.6	0.0	7.7
OS-Atlas-7B	12.2	4.7	33.1	1.4	28.8	2.8	37.5	7.3	33.9	5.7	27.1	4.5	18.9
ShowUI-2B	2.5	0.0	16.9	1.4	9.1	0.0	13.2	7.3	15.3	7.5	10.3	2.2	7.7
UGround-7B	14.2	1.6	26.6	2.1	27.3	2.8	31.9	2.7	31.6	11.3	17.8	0.0	16.5
UGround-V1-7B	15.8	1.2	51.9	2.8	47.5	9.7	57.6	14.5	60.5	13.2	38.3	7.9	31.1
UI-TARS-2B	17.8	4.7	47.4	4.1	42.9	6.3	56.9	17.3	50.3	17.0	21.5	5.6	27.7
UI-TARS-7B	20.8	9.4	58.4	12.4	50.0	9.1	63.9	31.8	63.3	20.8	30.8	16.9	35.7
InfiGUI-R1-3B	33.0	14.1	51.3	12.4	44.9	7.0	58.3	20.0	65.5	28.3	43.9	12.4	35.7
GUI-Specific Models (RL Only)
UI-R1-3B	11.2	6.3	22.7	4.1	27.3	3.5	42.4	11.8	32.2	11.3	13.1	4.5	17.8
GUI-R1-3B	26.4	7.8	33.8	4.8	40.9	5.6	61.8	17.3	53.6	17.0	28.1	5.6	30.2
GUI-R1-7B	23.9	6.3	49.4	4.8	38.9	8.4	55.6	11.8	58.7	26.4	42.1	16.9	32.4
SE-GUI-3B	38.1	12.5	55.8	7.6	47.0	4.9	61.8	16.4	59.9	24.5	40.2	12.4	35.9
GUI-G1-3B	39.6	9.4	50.7	10.3	36.6	11.9	61.8	30.0	67.2	32.1	23.5	10.6	37.1
GUI-Eyes-3B	48.2	9.4	70.8	12.4	56.6	13.3	69.4	19.1	75.7	24.5	59.8	20.2	44.8
GUI-C²-3B (Ours)	43.7	21.9	72.1	19.3	56.6	14.7	68.8	21.8	74.6	37.7	57.9	27.0	46.4

Table 1: Performance comparison on ScreenSpot-Pro. CAD, Scientific, Creative, Development are sub-datasets; icon/text columns denote target type within each sub-dataset.

ScreenSpot & ScreenSpot-v2

Comparison of model performance on ScreenSpot and ScreenSpot-v2. Results marked in bold and underline represent the best and second-best performance.

Model	Train Samples	SS Mobile	SS Desktop	SS Web	SS Avg.	SSV2 Mobile	SSV2 Desktop	SSV2 Web	SSV2 Avg.
Proprietary Models
GPT-4o	-	21.9	17.8	9.4	18.8	22.5	22.2	12.4	20.1
General Open-source Models
Qwen2-VL-7B	-	50.3	40.4	27.4	42.9	39.4	50.1	27.7	39.8
Qwen2.5-VL-3B	-	-	-	-	55.5	55.5	44.0	39.1	46.9
Qwen2.5-VL-7B	-	-	-	-	84.7	92.8	78.4	85.4	86.5
GUI-Specific Models
CogAgent-18B	222M	57.8	31.6	40.1	47.4	50.6	51.6	54.1	52.8
SeeClick-7B	1M	68.1	48.8	41.8	53.4	51.8	65.5	40.7	53.9
UGround-7B	10M	75.9	75.8	78.3	73.3	74.3	74.9	78.6	76.3
ShowUI-2B	256K	84.8	70.8	76.2	75.1	70.0	85.1	73.3	77.3
OS-Atlas-4B	13M	56.2	74.9	69.9	68.5	74.9	56.9	70.7	68.5
OS-Atlas-7B	13M	85.0	78.8	84.5	82.5	78.3	85.5	83.8	83.3
Aguvis-7B	1M	86.9	82.4	84.7	84.4	89.6	86.8	84.9	87.3
UI-TARS-2B	2M	85.0	81.4	79.8	82.3	87.9	81.4	82.9	84.7
GUI-C²-3B (Ours)	4.6K	85.5	87.1	85.1	85.8	88.2	88.6	86.7	87.8

Table 2: Comparison on ScreenSpot and ScreenSpot-v2. Bold = best, underline = second best.

GUI-C²-7B Results

Performance comparison on ScreenSpot-Pro. All methods use Qwen2/2.5-VL-7B as the base model.

Model	ScreenSpot-Pro Avg.
Qwen2.5-VL-7B	26.8
GUI-R1-7B	32.4
JEDI-7B	39.5
GUI-Actor-7B	44.6
SE-GUI-7B	47.3
GUI-G2-7B	47.5
OpenCUA-7B	50.0
GTA1-7B	50.1
GUI-C²-7B (Ours)	50.8

Table 3: GUI-C²-7B on ScreenSpot-Pro.

Ablation Study

Ablation experiments on ScreenSpot-Pro (GUI-C²-3B).

Variant	Inference Time (s)	Avg.
GUI-C² (Full)	3.05	46.4
w/o tool use	1.50	37.0
w/o coarse-to-fine policy	2.42	41.4
w/o difficulty-aware	3.05	43.7
w/ self-action-decide	24.81	43.5
adaptive crop ratio	3.05	43.8
w/ self-action-ratio-decide	32.71	43.4

Table 4: Ablation experiments on ScreenSpot-Pro.

Hyperparameter Study

Figure 4: Hyperparameter study on ScreenSpot-Pro.

Figure 5: Comparison of different maximum allowable crop stages on GUI-C²-3B.

Visualization Analysis

From left to right, the three examples show results from our testing under two-crop, one-crop, and direct-click actions, respectively. Among them, the results with two crops and one crop are correct, while the result with direct click is incorrect.

Figure 6: Visualization analysis.

Reward Function Parameters

Performance comparison of different reward function parameters on ScreenSpot-Pro (GUI-C²-3B).

λ_click	λ_iou	λ_fmt	Avg.
0.6	0.3	0.1	46.4
0.5	0.4	0.1	45.7
0.7	0.2	0.1	46.1

Table 5: Reward function parameter study on ScreenSpot-Pro.

BibTeX


@misc{li2026guic2coarsetofineguigrounding,
      title={GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning}, 
      author={Junlong Li and Chao Hao and Lap-Pui Chau and Yi Wang},
      year={2026},
      eprint={2605.30884},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.30884}, 
}

GUI-C²

Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

🔔News

Abstract

GUI-C² Framework

GUI-D: Difficulty-Aware Data Curation

GUI-C²: Coarse-to-Fine Policy

Key Contributions

Leaderboard

ScreenSpot-Pro

ScreenSpot & ScreenSpot-v2

GUI-C²-7B Results

Ablation Study

Hyperparameter Study

Visualization Analysis

Reward Function Parameters

BibTeX

GUI-C2

Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

🔔News

Abstract

GUI-C2 Framework

GUI-D: Difficulty-Aware Data Curation

GUI-C2: Coarse-to-Fine Policy

Key Contributions

Leaderboard

ScreenSpot-Pro

ScreenSpot & ScreenSpot-v2

GUI-C2-7B Results

Ablation Study

Hyperparameter Study

Visualization Analysis

Reward Function Parameters

BibTeX

GUI-C²

GUI-C² Framework

GUI-C²: Coarse-to-Fine Policy

GUI-C²-7B Results