You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Diagnostic of physics literacy in frontier LLMs. Evaluates induction, formulation, and prediction inside unfamiliar physics frameworks via dual-judge inter-rater reliability with human-audit resolution.
CTA-Bench: research benchmark and toolkit for studying how well systems turn problem descriptions and reference code into Lean 4 proof obligations, and how faithful those obligations are to the intended algorithm.