Short, practical research memos on agentic safety, safeguards, and evaluation failures.
The goal is not to publish another generic AI safety blog. Each memo is written to be useful to people building or reviewing agentic systems: what fails, why the usual benchmark misses it, and what kind of engineering gate would catch it before release.
Agent safety work often gets split into two weak forms:
- high-level essays that do not tell an engineer what to test
- benchmark reports that do not explain the underlying failure mechanism
These memos sit between the two. They turn a safety argument into a concrete design pressure for the rest of the portfolio: stress tests, regression suites, release gates, incident replay, and agent definition scanners.
- Why Single-Turn Safety Benchmarks Systematically Underestimate Agentic Risk A memo on why static, one-shot safety checks miss slow-burn failures that emerge across multi-turn agent trajectories.
Read a memo, then follow the implementation path:
- Use when-rlhf-fails-quietly to name the failure mode.
- Use agentic-misuse-benchmark to turn it into a measurable scenario.
- Use safety-harness to stress-test, pin regressions, and gate releases.
- Use agentguard when the risk lives in agent definitions, tool grants, hooks, or commands.
- when-rlhf-fails-quietly — Evaluating silent alignment failures
- agentic-misuse-benchmark — Multi-turn misuse detection benchmark
- safety-harness — Closed-loop runtime safety harness: stress-testing, regression suite, release gate, simulator, and incident lab in one system