Mimic

The pytest for AI agents. Record, replay, assert, and diff agent behavior.

Mimic is an open-source library that lets you record an AI agent's behavior, replay it deterministically, assert properties about it, and diff runs across versions. It's the missing testing layer for the agent era.

from mimic import Mimic, assert_that, replay
from mimic.integrations.openai import tracked_completion

mimic = Mimic()
client = OpenAI()

@mimic.record("customer-support-agent", model="gpt-4o")
def answer(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    tracked_completion(resp)  # auto-captures tokens + cost
    return resp.choices[0].message.content

Verified performance

Scenario	Record mode	Replay mode	Savings
5-test multi-step agent suite	360 ms	50 ms	7× faster
1000 CI runs of the same suite	~$2 in LLM cost	$0	100%

Run mimic benchmark --runs 1000 on your own recordings to see your numbers.

Why Mimic?

Every team building AI agents hits the same wall:

"I changed a prompt. Did I break anything?" — You don't know.
"I switched from GPT-4 to Claude. Is it 2x more expensive?" — You don't know.
"Did this agent ever call delete_file in production?" — You don't know.
"Why did the agent fail on Tuesday at 3pm?" — You don't know.

Mimic turns those unknowns into testable, replayable, diffable artifacts. Think Sentry recordings + pytest assertions + git blame, purpose-built for LLM agents.

Features

✅ Record any callable — sync or async, LLM calls, tool use, multi-step agents
✅ Replay runs offline with zero API cost, byte-for-byte deterministic
✅ Assert behavioral properties: cost, latency, tool usage, output content
✅ Diff two runs to see exactly what changed
✅ Auto-track LLM costs for OpenAI, Anthropic, Gemini (zero-config)
✅ Multi-step agents with per-step recording, cost, and metadata
✅ Privacy mode (capture_args=False, capture_return=False)
✅ Storage-agnostic — filesystem by default, pluggable for S3/Postgres
✅ Zero LLM vendor lock-in — works with any model
✅ Beautiful CLI — mimic run / list / show / diff / report / benchmark
✅ CI-ready — GitHub Actions template + pre-commit hook included

Install

pip install mimic-recording

Or with optional integrations:

pip install mimic-recording[openai]
pip install mimic-recording[anthropic]

Quick start

mkdir my-agent && cd my-agent
mimic init

This creates a project skeleton:

my-agent/
├── mimic.yaml           # Project config
├── tests/
│   └── test_agent.py    # Your recorded tests
└── .mimic/              # Recorded runs (gitignored by default)

Edit tests/test_agent.py:

from mimic import Mimic, assert_that, replay

mimic = Mimic()

@mimic.record("my-agent", model="gpt-4o")
def answer(question: str) -> str:
    # ... your LLM call here ...
    return "..."

def test_agent():
    answer("hello")
    recorded = replay("my-agent")
    assert_that(recorded).finished_without_errors()
    assert_that(recorded).cost_less_than(usd=0.05)
    assert_that(recorded).did_not_call_tool("delete_database")

Run it:

mimic run tests/                 # records + runs (costs $$)
MIMIC_MODE=replay mimic run tests/  # replays only (free, deterministic)

Multi-step agents

For ReAct, multi-agent, or any agent with multiple LLM/tool calls, record each step:

@mimic.record("research-agent")
async def research(question: str) -> str:
    # Step 1: plan
    with mimic.step("plan", model="gpt-4o-mini") as s:
        resp = await llm.complete(model="gpt-4o-mini", messages=[...])
        tracked_completion(resp)
        s.metadata["plan_steps"] = 3

    # Step 2: search
    with mimic.step("search") as s:
        results = await web_search(question)
        s.metadata["result_count"] = len(results)

    # Step 3: synthesize
    with mimic.step("synthesize", model="gpt-4o") as s:
        resp = await llm.complete(model="gpt-4o", messages=[...])
        tracked_completion(resp)

    return summary

Assertions

The full chain (all return self for fluent chaining):

assert_that(run).finished_without_errors()
assert_that(run).had_error()                   # inverse
assert_that(run).cost_less_than(usd=0.05)
assert_that(run).completed_under(ms=2000)
assert_that(run).output_contains("substring")
assert_that(run).output_matches(r"regex")
assert_that(run).output_equals(value)
assert_that(run).called_tool("search")
assert_that(run).did_not_call_tool("delete_database")
assert_that(run).called_tools(["search", "synthesize"])
assert_that(run).had_exactly(3)
assert_that(run).had_at_least(2)
assert_that(run).used_model("gpt-4o")

How it works

Mimic sits outside your agent code, watching the inputs and outputs of any function you decorate. The first time the function runs, Mimic records the full execution into a content-addressable store. Subsequent test runs use the stored record instead of calling the LLM, making them fast, free, and deterministic.

For multi-step agents, Mimic records each step separately, so you can replay just the broken step without re-running the whole agent.

CI integration

Drop the included .github/workflows/test.yml into your repo. It runs your test suite in replay mode (no LLM cost) and validates that no cost was incurred.

Manual re-recording is a separate job, triggered on workflow_dispatch or a schedule.

The recording format

Mimic recordings are plain JSON conforming to a documented schema — see RECORDING_FORMAT.md. The format is vendor-neutral: you can build readers, web UIs, or analysis tools without depending on the Mimic library.

The $100M thesis

Mimic sits at the intersection of three exploding markets:

AI agent development — 10M+ developers will build agents by 2027.
AI observability — already a $2B+ market, dominated by closed vendors (LangSmith, Helicone, Langfuse).
AI safety & compliance — every enterprise deploying agents needs guardrails, audit trails, and replay.

The land-and-expand model is proven (Sentry, Supabase, GitLab, Vercel, PostHog): open source core → community growth → enterprise tier with self-hosted, SSO, audit logs, and SOC2.

See BUSINESS_PLAN.md for the full strategy.

Roadmap

v0.1 — Record/replay/assert core
v0.2 — Async + multi-step + OpenAI/Anthropic cost tracking
v0.3 — Web UI for browsing recorded runs
v0.4 — TypeScript SDK
v0.5 — Auto-generated regression tests from production traces
v0.6 — Multi-agent parent/child traces
v1.0 — Enterprise self-hosted edition

Contributing

We love contributions. See CONTRIBUTING.md.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
demo		demo
ppt		ppt
src/mimic		src/mimic
tests		tests
viewer		viewer
.gitignore		.gitignore
.pre-commit-hooks.yaml		.pre-commit-hooks.yaml
BUSINESS_PLAN.md		BUSINESS_PLAN.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DELIVERABLES.md		DELIVERABLES.md
LAUNCH_KIT.md		LAUNCH_KIT.md
LICENSE		LICENSE
README.md		README.md
READY_TO_LAUNCH.md		READY_TO_LAUNCH.md
RECORDING_FORMAT.md		RECORDING_FORMAT.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mimic

Verified performance

Why Mimic?

Features

Install

Quick start

Multi-step agents

Assertions

How it works

CI integration

The recording format

The $100M thesis

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mimic

Verified performance

Why Mimic?

Features

Install

Quick start

Multi-step agents

Assertions

How it works

CI integration

The recording format

The $100M thesis

Roadmap

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages