Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 233 additions & 0 deletions presentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
---
marp: true
theme: default
paginate: true
style: |
section {
font-family: 'Segoe UI', Arial, sans-serif;
font-size: 21px;
padding: 36px 50px;
}
h1 { font-size: 34px; color: #1a1a2e; }
h2 { font-size: 26px; color: #16213e; border-bottom: 2px solid #0f3460; padding-bottom: 6px; margin-bottom: 14px; }
table { font-size: 17px; width: 100%; border-collapse: collapse; }
th { background: #0f3460; color: white; padding: 6px 10px; }
td { padding: 5px 10px; border-bottom: 1px solid #ddd; }
.highlight { background: #e8f4f8; border-left: 4px solid #0f3460; padding: 8px 14px; border-radius: 4px; margin-top: 10px; }
.positive { background: #eafaf1; border-left: 4px solid #27ae60; padding: 8px 14px; border-radius: 4px; margin-top: 10px; }
.warn { background: #fef9e7; border-left: 4px solid #f39c12; padding: 8px 14px; border-radius: 4px; margin-top: 10px; }
code { background: #f4f4f4; padding: 2px 6px; border-radius: 3px; font-size: 17px; }
---

<!-- Slide 1: Title -->
# Cross-Website GUI Grounding
## with Verifiable Reward Optimization

**CSC6129 Reinforcement Learning — Final Project**
Che Cheng · 2026

<br>

**Three core findings:**
1. Hybrid OCR/DOM candidate augmentation is **critical** when candidate structure is semantically informative
2. Point-first grounding **transfers well** across held-out GUI benchmarks
3. Reward-based reranking **helps conditionally** — gains saturate once supervision is already strong

---

<!-- Slide 2: Problem Framing -->
## What is GUI Grounding?

**Task:** Given a screenshot + natural-language instruction → predict the correct UI action

```
Input: [screenshot] + "Click the search button"
Output: click_point: (0.52, 0.13) bbox: [0.45, 0.10, 0.60, 0.17] action_type: click
```

**Scope:** Single-step grounding only — not full browser automation or long-horizon planning.

**Why it's an RL problem** — reward is **verifiable and deterministic:**

| Signal | Meaning |
|---|---|
| Element match | Predicted element = annotated element |
| Click inside target | Click point falls within bounding box |
| IoU ≥ 0.5 | Predicted box overlaps ground truth |
| Action-type correct | `click` / `type` / `select` matches label |
| Invalid format penalty | Penalizes malformed coordinate output |

→ Natural fit for a **contextual bandit**: context = screenshot + instruction, action = grounded UI action

---

<!-- Slide 3: System Architecture -->
## System Architecture

```
Screenshot + Instruction (+ OCR/DOM candidate anchors ← hybrid mode)
┌─────────────────────────┐
│ Stage A · Grounding │ Qwen2.5-VL 3B SFT on Mind2Web
│ click_point + bbox │ candidate_slot supervision (hybrid)
└────────────┬────────────┘
│ k = 4 candidates
┌─────────────────────────┐
│ Stage B · Reranker │ Verifiable reward scoring
│ best-of-k selection │ Learned reward model
└────────────┬────────────┘
Final Action Output
```

**Two research questions:**
1. How strong can Stage A become?
2. When does Stage B reranking still add value on top?

---

<!-- Slide 4: Benchmarks -->
## Three Benchmarks, Three Roles

| Benchmark | Role | Key Purpose |
|---|---|---|
| **Mind2Web** | Primary supervised | Stage A SFT, hybrid augmentation, Stage B reranking, headroom analysis |
| **ScreenSpot-v2** | Primary held-out | Clean transfer evaluation: point-native & dual-path verifier |
| **VisualWebBench** | Supplementary transfer | What transfers vs. what breaks under protocol mismatch |

<br>

**Mind2Web generalization splits** — three levels of difficulty:

| Split | Description |
|---|---|
| `test_task` | Same website, new task instructions |
| `test_website` | Held-out websites (harder) |
| `test_domain` | Held-out domains (hardest) |

---

<!-- Slide 5: Mind2Web Hybrid Stage A -->
## Mind2Web Stage A: Hybrid Candidate Augmentation

**Key design:** Augment screenshot with compact OCR/DOM-style candidate anchors + supervise auxiliary `candidate_slot` target alongside grounding output.

Pure visual Stage A was insufficient (point acc ≈ 0.04 even after bug fix) → structured candidate evidence was essential.

<div class="positive">

**Hybrid Stage A — internal validation jump:**
Point accuracy **0.0375 → 0.7875** &nbsp;·&nbsp; Mean IoU **0.0061 → 0.7314**

</div>

**Official cached subset results:**

| Split | Element Acc | Click-Point Acc | IoU@0.5 | Action Acc |
|---|---:|---:|---:|---:|
| `test_task` | **95.00%** | **95.00%** | **95.00%** | 90.00% |
| `test_website` | 80.00% | 85.00% | 85.00% | **95.00%** |
| `test_domain` | 89.47% | 89.47% | 89.47% | 94.74% |

Average gain over pure visual: **+88 pts** on element / point / IoU@0.5

---

<!-- Slide 6: Stage B Reranking -->
## Stage B: When Does Reranking Help?

**Setup:** Stage A generates k = 4 candidates → learned reward model selects the best one.

**Oracle best-of-k headroom** (upper bound on what reranking can recover):

| Candidate Pool | Oracle Point Gain | Reward Gain |
|---|---:|---:|
| Historical k=4 &nbsp;(weak Stage A) | **+10.17 pts** | 0.0376 |
| Historical k=8 expanded | **+10.17 pts** | 0.0431 |
| **Final hybrid rebuild k=4** | **+5.08 pts** | 0.0983 |

<div class="highlight">

**Key insight:** Weak Stage A left ~10 pts of recoverable headroom in the candidate pool. Strong hybrid Stage A cut that headroom to ~5 pts. **Reranking is a conditional gain, not a universal win.**

</div>

**Design trade-off:** Investing in better Stage A representation delivers more reliable gains than scaling up Stage B reranking.

---

<!-- Slide 7: ScreenSpot-v2 -->
## ScreenSpot-v2: Held-Out Transfer Benchmark

**Debugging step that unblocked evaluation:** Initial results were near-zero — traced to a coordinate-frame mismatch (model predicted in resized-image space; evaluation scored against original-image coordinates). Fixing this produced the first credible held-out baseline.

**Final method comparison:**

| Method | Point Acc | IoU@0.5 | Mean IoU |
|---|---:|---:|---:|
| Reproduced public Qwen baseline | 75.63% | 5.19% | 13.27% |
| Point-native decoupled | 77.36% | 9.67% | 19.12% |
| **Dual-path verifier** | **77.91%** | **17.22%** | **25.20%** |

<div class="positive">

Point-native inference beat the reproduced baseline. Dual-path verifier improved IoU@0.5 by **+12 pts** with only **0.000167 s/image** overhead.

</div>

**Subgroup finding:** Text elements outperform icon elements by **+19.70 pts** point accuracy.

---

<!-- Slide 8: VisualWebBench -->
## VisualWebBench: Sharpening the Transfer Claim

**Protocol difference from Mind2Web:** Anonymous 8-box layout — no semantic candidate labels attached to boxes.

| Method | Official Choice Acc | Point Acc | Mean IoU |
|---|---:|---:|---:|
| Structured screenshot-only | 78.88% | 64.53% | 32.06% |
| **Point-native decoupled** | **87.21%** | **79.46%** | 34.35% |
| Dual-path verifier | 86.82% | 79.65% | 33.12% |
| Mind2Web hybrid transfer | 23.84% | 23.84% | 23.84% |

<div class="warn">

**Mind2Web hybrid transfer collapsed to 23.84%** — candidate slot semantics do not survive an anonymous-box protocol.

</div>

**Refined transfer claim:** Point-first grounding is robust across protocols. Candidate-aware methods require matching benchmark semantics to transfer.

---

<!-- Slide 9: Final Takeaways -->
## Final Takeaways

**What worked well:**

✅ Hybrid OCR/DOM candidate augmentation → **+88 pts** on Mind2Web — decisive, not marginal

✅ Point-first grounding → strong transfer on both ScreenSpot-v2 and VisualWebBench

✅ Dual-path verifier → consistent IoU gain with negligible runtime overhead

<br>

**Key design lessons:**

⚠️ Reward-based reranking is **conditional** — valuable when the baseline leaves recoverable headroom; saturates once Stage A is strong

⚠️ Candidate-aware transfer requires **protocol compatibility** — fails on anonymous-box benchmarks

<br>

<div class="highlight">

**Honest scope:** This is a single-step perception layer, not a complete browser agent.
Correct claim: *"strong representation first, reward as conditional second-stage gain"*

</div>
Binary file added presentation.pptx
Binary file not shown.
122 changes: 122 additions & 0 deletions script.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Speaking Script — Cross-Website GUI Grounding
10 minutes · estimated time per section in brackets

---

## Slide 1 · Title + Core Thesis (~1 min)

Hi everyone. My project is called Cross-Website GUI Grounding with Verifiable Reward Optimization — this is my final project for CSC6129 Reinforcement Learning.

The core question is: can we train a model to look at a webpage screenshot, read a natural-language instruction, and accurately locate and interact with the right UI element?

We ran experiments across three benchmarks and came away with three main findings. First, hybrid OCR/DOM candidate augmentation is critical for strong performance on Mind2Web — it's not a marginal gain, it's what makes the system work at all. Second, point-first grounding transfers well across held-out benchmarks without needing matched candidate structure. Third, reward-based reranking is a conditional gain — it helps when the baseline is weak enough to leave recoverable headroom, but the benefit shrinks sharply once Stage A becomes strong.

---

## Slide 2 · Problem Framing (~1 min)

Let me define the task clearly.

The input is a screenshot plus a natural-language instruction — something like "click the search button." The output is a predicted click point, a bounding box, and an action type: click, type, or select.

An important scope note: we study single-step grounding only. This is not a full browser agent — there's no multi-step planning, no execution loop, no recovery from errors. We isolate the perception-and-selection layer.

This task is a natural fit for reinforcement learning because the reward is fully verifiable. We can check automatically: did the predicted click point land inside the target element? Does the bounding box overlap with ground truth at IoU ≥ 0.5? Is the action type correct? These are all deterministic checks. So we frame it as a contextual bandit — the screenshot and instruction are the context, the grounded action is the action, and the reward is computed from these rules.

---

## Slide 3 · System Architecture (~1 min)

The system has two stages.

Stage A is the grounding model, built on Qwen2.5-VL 3B. It takes a screenshot and instruction as input, and predicts the click point and bounding box. In hybrid mode, it also receives compact OCR/DOM-style candidate anchors and is supervised on an auxiliary candidate slot target alongside the grounding output.

Stage B is a reranker. Stage A generates k equals 4 candidates, and a learned reward model selects the best one.

The project answers two design questions: how strong can Stage A become, and when does Stage B reranking still add meaningful value on top of that?

---

## Slide 4 · Three Benchmarks (~1 min)

We use three benchmarks, each serving a different role.

Mind2Web is our primary supervised benchmark. We train Stage A here, build the hybrid augmentation, run Stage B candidate generation and reranker training, and analyze recoverable headroom. It has three test splits: test_task tests new tasks on seen websites, test_website tests on unseen websites, and test_domain tests on unseen domains. These three splits let us measure how quickly generalization degrades in open-web settings.

ScreenSpot-v2 is our primary held-out benchmark. It doesn't touch training at all — it's purely for evaluating transfer.

VisualWebBench is a supplementary transfer check. We use it to test whether our conclusions hold under a different benchmark protocol.

---

## Slide 5 · Mind2Web Hybrid Breakthrough (~1.5 min)

The most important result on Mind2Web is the hybrid candidate augmentation breakthrough.

We started with a pure visual Stage A. We fixed a geometry-collapse bug first, so the comparison is fair. Even after that fix, pure visual Stage A only reached an internal point accuracy of 0.04 and a mean IoU of 0.006 — essentially no effective localization. Screenshot-only supervision couldn't give the model enough structured evidence to ground UI elements reliably.

After adding OCR/DOM candidate anchors, the result changed dramatically. Internal point accuracy jumped from 0.04 to 0.79, and mean IoU went from 0.006 to 0.73.

On the official cached subset, the results across three splits are: 95% click-point accuracy on test_task, 85% on test_website, and 89.5% on test_domain. The average gain over pure visual is plus 88 percentage points on element accuracy, click-point accuracy, and IoU@0.5.

This isn't a fine-tuning margin — it's the difference between a system that can and cannot ground UI elements at all. The candidate structure carries semantic evidence that the model simply cannot recover from the screenshot alone.

---

## Slide 6 · Stage B Reranking (~1 min)

The Stage B conclusion is more nuanced and worth being careful about.

We measure oracle best-of-k headroom — that's the theoretical upper bound on what reranking can recover from a candidate pool, assuming a perfect selector.

On the historical weak Stage A, oracle point gain was plus 10 points for both k equals 4 and k equals 8. There was real recoverable headroom in the pool.

After rebuilding the candidate pool on the strong hybrid Stage A, the same k equals 4 oracle gain shrank to plus 5 points — cut in half.

The takeaway is: the stronger Stage A becomes, the fewer recoverable errors remain in the pool, and the less reranking can do. Investing in Stage A representation delivers more reliable gains than scaling up Stage B candidate diversity. Reranking is not a universal win — it's a second-stage mechanism that depends on the baseline leaving room to recover.

---

## Slide 7 · ScreenSpot-v2 (~1.5 min)

ScreenSpot-v2 is our cleanest held-out benchmark, but we hit a serious problem early on — initial results were near zero.

After investigation, we found a coordinate-frame mismatch. The model was predicting coordinates in resized-image space, but the evaluation was scoring against original-image coordinates. The two systems were not aligned. Fixing this mismatch was what made the held-out benchmark meaningful.

After the fix, the method ladder looks like this. Our reproduced plain-Qwen public baseline reached 75.6% point accuracy. Point-native decoupled inference improved that to 77.4%. Adding the dual-path verifier brought it to 77.9% point accuracy and 17.2% IoU@0.5 — a plus 12 point IoU improvement over the baseline.

The verifier overhead is only 0.000167 seconds per image, so it's essentially free in terms of runtime.

One interesting subgroup finding: text elements outperform icon elements by nearly 20 percentage points in point accuracy, which suggests localization on icon-based targets still has room to improve.

---

## Slide 8 · VisualWebBench (~1 min)

VisualWebBench sharpened our transfer claim by revealing where it breaks.

This benchmark uses an anonymous 8-box protocol — no semantic candidate labels, just numbered boxes. That's a different protocol from Mind2Web.

Point-native inference still transferred well here: 87.2% official choice accuracy, about 9 points above structured screenshot-only. Dual-path verifier matched that roughly.

But when we took the Mind2Web hybrid candidate-aware model and applied it directly, performance collapsed to 23.8%. The candidate slot semantics simply don't exist in an anonymous-box protocol — the model received structurally mismatched inputs and failed.

So the refined claim is: point-first inference is robust across protocols because it doesn't rely on semantic candidate labels. Candidate-aware methods require protocol compatibility to transfer.

---

## Slide 9 · Final Takeaways (~1 min)

Let me close with the key lessons.

Three things worked reliably. Hybrid OCR/DOM candidate augmentation delivered plus 88 points on Mind2Web — that's a decisive result, not a marginal one. Point-first grounding transferred well to both ScreenSpot-v2 and VisualWebBench without needing matched candidate structure. The dual-path verifier gave consistent IoU improvement at negligible runtime cost.

Two things were conditional or failed. Reward-based reranking is valuable when the baseline is weak, but saturates after strong Stage A supervision. Candidate-aware methods fail under mismatched benchmark protocols.

The one-sentence summary of this project: strong representation first, reward as a conditional second-stage gain. The project doesn't claim RL always wins — it claims that verifiable reward adds measurable value under specific conditions, and we've been honest about where those conditions stop holding.

Thank you. Happy to take questions.

---

*Total ~950 words · approximately 9–10 minutes at a comfortable presentation pace.*