Congrats on the 72.7% WebArena score — meka looks like a strong browsing agent.
Since ClawBench is structured as a live-production-site companion to WebArena (same browser-agent format, but 144 real websites instead of sandboxed ones), it might be a meaningful next data point. What we've seen so far: agents that score high on WebArena have been dropping significantly on ClawBench due to cookie banners, auth flows, dynamic JS, and site-specific friction that don't show up in static sandbox sites. Currently 7 frontier models on the leaderboard, with the top at 33.3% — so there is a lot of headroom for a WebArena-tier agent to differentiate.
Benchmark details:
If you want to submit a run, setup instructions are at claw-bench.com and the repo. Would be a cool comparison to your WebArena number.
Disclosure: I am affiliated with the ClawBench project.
Congrats on the 72.7% WebArena score — meka looks like a strong browsing agent.
Since ClawBench is structured as a live-production-site companion to WebArena (same browser-agent format, but 144 real websites instead of sandboxed ones), it might be a meaningful next data point. What we've seen so far: agents that score high on WebArena have been dropping significantly on ClawBench due to cookie banners, auth flows, dynamic JS, and site-specific friction that don't show up in static sandbox sites. Currently 7 frontier models on the leaderboard, with the top at 33.3% — so there is a lot of headroom for a WebArena-tier agent to differentiate.
Benchmark details:
If you want to submit a run, setup instructions are at claw-bench.com and the repo. Would be a cool comparison to your WebArena number.
Disclosure: I am affiliated with the ClawBench project.