Have you considered running meka on ClawBench (live-site browser eval)?

Congrats on the 72.7% WebArena score — meka looks like a strong browsing agent.

Since ClawBench is structured as a live-production-site companion to WebArena (same browser-agent format, but 144 real websites instead of sandboxed ones), it might be a meaningful next data point. What we've seen so far: agents that score high on WebArena have been dropping significantly on ClawBench due to cookie banners, auth flows, dynamic JS, and site-specific friction that don't show up in static sandbox sites. Currently 7 frontier models on the leaderboard, with the top at 33.3% — so there is a lot of headroom for a WebArena-tier agent to differentiate.

Benchmark details:
- 153 everyday web tasks on 144 live production websites across 15 life categories
- Submission-interception layer (Chrome extension + CDP) blocks only the final write request, so agents complete end-to-end flows on live sites without real-world side effects (no real orders, emails, applications)
- Paper: https://arxiv.org/abs/2604.08523
- Repo: https://github.com/reacher-z/ClawBench
- Dataset: https://huggingface.co/datasets/NAIL-Group/ClawBench
- Site: https://claw-bench.com

If you want to submit a run, setup instructions are at claw-bench.com and the repo. Would be a cool comparison to your WebArena number.

Disclosure: I am affiliated with the ClawBench project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Have you considered running meka on ClawBench (live-site browser eval)? #89

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Have you considered running meka on ClawBench (live-site browser eval)? #89

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions