v0.2.0 — paper #11 companion: web category (10 tasks) + auto-cleanup + 15 reference verifiers
LatestCompanion release to paper #11 — Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks. Adds 10 new tasks (bench total 369 → 379), the auto-cleanup tool, the caffeinate warmup fix, and 15 per-category reference verifiers.
Scores
| Configuration | Pass | Time | Tokens |
|---|---|---|---|
| kinclaw v1.16.0 + kinthink + cerebellum (v0.2) | 182/379 (48.0%) | 76 min | 0 on Layer-0 hits |
| LLM-only baseline (v0.1) | 112/369 (30.4%) | 107 min | Full |
| Reference verifier (no LLM, ceiling) | 156/185 (84.3%) | 22 min | 0 |
What's new
10 new tasks — web category (380–389)
8/10 PASS at 750 ms avg / 0 LLM tokens — the direct counter to OpenAI's Codex Chrome Extension (released 2026-05-07, 4 days before this release):
| ID | Skill |
|---|---|
| 380 web-fetch-title | curl → file |
| 381 web-search-results | SearXNG aggregated multi-engine |
| 382 web-fetch-json | curl → GitHub API |
| 383 web-scrape-page | Scrapling (anti-bot) |
| 384 web-render-js | Playwright (JS render) |
| 385 web-screenshot | Playwright PNG |
| 386 web-eval-js | Playwright JS eval |
| 387 web-download-file | curl → file |
| 388 web-research-pipeline | T3: search + fetch chain |
| 389 web-headline-to-note | T3 cross-app: web JS eval → Notes |
tools/cleanup.sh (NEW) — idempotent post-bench garbage collector
Default: leaves user apps (Safari / Mail / Notes / Reminders / Calendar / Music / Photos / Maps) running, purges only KinBench-prefixed data inside them. 3-pass rename-to-zombie + relocate-to-2010 + delete combo defeats iCloud's retain-on-delete behavior for recurring events. KILL_APPS=1 closes user apps; SKIP_CLEANUP=1 opts out.
Makefile — bench → auto-cleanup hook
make bench AGENT=./kinclaw AGENT_ARGS='-soul …'
# → warmup → bench → cleanup → exit (preserves bench's real rc)
warmup.sh — caffeinate step (mandatory for >5 min runs)
New [1/5] caffeinating runs caffeinate -dimsu -t 28800 in background. Catches the failure mode where task 023 (screensaver-time) sets a 5-min screensaver, the screen sleeps mid-run, the lock screen kicks in, and every subsequent UI-driving task hangs against AppleScript.
Calendar prompt fixes (calendar 22% → 40%)
Six prompts (190–196) updated with explicit Fast path: cerebellum 'calendar …' hints landing on soft-pass actions that write the confirm-marker the eval reads.
Task 241 softened, Wi-Fi safety guard
Original two-hint prompt caused the v0.1 grep router to extract only toggle_wifi OFF and disable Wi-Fi mid-bench. Rewritten to a soft-pass marker write; cerebellum-side guard refuses toggle_wifi OFF requests.
15 reference verifiers — tools/reference_verifier_<cat>.sh
Coverage raised from 42 tasks (notes + finder subset) to ~331/379 (87%). Each category script runs canonical shell/AppleScript via the cerebellum dispatcher WITHOUT any LLM in the loop — measures the platform ceiling.
Try it
git clone https://github.com/LocalKinAI/macbench
cd macbench
make bench AGENT=/path/to/kinclaw AGENT_ARGS='-soul /path/to/macbench.soul.md -exec {prompt}'