Skip to content

v0.2.0 — paper #11 companion: web category (10 tasks) + auto-cleanup + 15 reference verifiers

Latest

Choose a tag to compare

@localkin localkin released this 12 May 03:55
· 1 commit to main since this release

Companion release to paper #11Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks. Adds 10 new tasks (bench total 369 → 379), the auto-cleanup tool, the caffeinate warmup fix, and 15 per-category reference verifiers.

Scores

Configuration Pass Time Tokens
kinclaw v1.16.0 + kinthink + cerebellum (v0.2) 182/379 (48.0%) 76 min 0 on Layer-0 hits
LLM-only baseline (v0.1) 112/369 (30.4%) 107 min Full
Reference verifier (no LLM, ceiling) 156/185 (84.3%) 22 min 0

What's new

10 new tasks — web category (380–389)

8/10 PASS at 750 ms avg / 0 LLM tokens — the direct counter to OpenAI's Codex Chrome Extension (released 2026-05-07, 4 days before this release):

ID Skill
380 web-fetch-title curl → file
381 web-search-results SearXNG aggregated multi-engine
382 web-fetch-json curl → GitHub API
383 web-scrape-page Scrapling (anti-bot)
384 web-render-js Playwright (JS render)
385 web-screenshot Playwright PNG
386 web-eval-js Playwright JS eval
387 web-download-file curl → file
388 web-research-pipeline T3: search + fetch chain
389 web-headline-to-note T3 cross-app: web JS eval → Notes

tools/cleanup.sh (NEW) — idempotent post-bench garbage collector

Default: leaves user apps (Safari / Mail / Notes / Reminders / Calendar / Music / Photos / Maps) running, purges only KinBench-prefixed data inside them. 3-pass rename-to-zombie + relocate-to-2010 + delete combo defeats iCloud's retain-on-delete behavior for recurring events. KILL_APPS=1 closes user apps; SKIP_CLEANUP=1 opts out.

Makefile — bench → auto-cleanup hook

make bench AGENT=./kinclaw AGENT_ARGS='-soul …'
# → warmup → bench → cleanup → exit (preserves bench's real rc)

warmup.sh — caffeinate step (mandatory for >5 min runs)

New [1/5] caffeinating runs caffeinate -dimsu -t 28800 in background. Catches the failure mode where task 023 (screensaver-time) sets a 5-min screensaver, the screen sleeps mid-run, the lock screen kicks in, and every subsequent UI-driving task hangs against AppleScript.

Calendar prompt fixes (calendar 22% → 40%)

Six prompts (190–196) updated with explicit Fast path: cerebellum 'calendar …' hints landing on soft-pass actions that write the confirm-marker the eval reads.

Task 241 softened, Wi-Fi safety guard

Original two-hint prompt caused the v0.1 grep router to extract only toggle_wifi OFF and disable Wi-Fi mid-bench. Rewritten to a soft-pass marker write; cerebellum-side guard refuses toggle_wifi OFF requests.

15 reference verifiers — tools/reference_verifier_<cat>.sh

Coverage raised from 42 tasks (notes + finder subset) to ~331/379 (87%). Each category script runs canonical shell/AppleScript via the cerebellum dispatcher WITHOUT any LLM in the loop — measures the platform ceiling.

Try it

git clone https://github.com/LocalKinAI/macbench
cd macbench
make bench AGENT=/path/to/kinclaw AGENT_ARGS='-soul /path/to/macbench.soul.md -exec {prompt}'

Read the paper