Skip to content

Have you considered running meka on ClawBench (live-site browser eval)? #89

@reacher-z

Description

@reacher-z

Congrats on the 72.7% WebArena score — meka looks like a strong browsing agent.

Since ClawBench is structured as a live-production-site companion to WebArena (same browser-agent format, but 144 real websites instead of sandboxed ones), it might be a meaningful next data point. What we've seen so far: agents that score high on WebArena have been dropping significantly on ClawBench due to cookie banners, auth flows, dynamic JS, and site-specific friction that don't show up in static sandbox sites. Currently 7 frontier models on the leaderboard, with the top at 33.3% — so there is a lot of headroom for a WebArena-tier agent to differentiate.

Benchmark details:

If you want to submit a run, setup instructions are at claw-bench.com and the repo. Would be a cool comparison to your WebArena number.

Disclosure: I am affiliated with the ClawBench project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions