An open taxonomy and scoring framework for evaluating AI agent sandboxes: 7 defense layers, 7 threat categories, 3 evaluation dimensions, 27 "sandboxes" scored.
-
Updated
Jun 10, 2026 - Go
An open taxonomy and scoring framework for evaluating AI agent sandboxes: 7 defense layers, 7 threat categories, 3 evaluation dimensions, 27 "sandboxes" scored.
A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.
Multi-axis scoring framework that ranks programmable genome editors across eight orthogonal axes into a single PenScore to guide experimental design and benchmarking.
Add a description, image, and links to the scoring-framework topic page so that developers can more easily learn about it.
To associate your repository with the scoring-framework topic, visit your repo's landing page and select "manage topics."