Skip to content

Eamon3949/mimic-recording

Repository files navigation

Mimic

The pytest for AI agents. Record, replay, assert, and diff agent behavior.

PyPI License: MIT Python 3.9+

Mimic is an open-source library that lets you record an AI agent's behavior, replay it deterministically, assert properties about it, and diff runs across versions. It's the missing testing layer for the agent era.

from mimic import Mimic, assert_that, replay
from mimic.integrations.openai import tracked_completion

mimic = Mimic()
client = OpenAI()

@mimic.record("customer-support-agent", model="gpt-4o")
def answer(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    tracked_completion(resp)  # auto-captures tokens + cost
    return resp.choices[0].message.content

Verified performance

Scenario Record mode Replay mode Savings
5-test multi-step agent suite 360 ms 50 ms 7× faster
1000 CI runs of the same suite ~$2 in LLM cost $0 100%

Run mimic benchmark --runs 1000 on your own recordings to see your numbers.

Why Mimic?

Every team building AI agents hits the same wall:

  • "I changed a prompt. Did I break anything?" — You don't know.
  • "I switched from GPT-4 to Claude. Is it 2x more expensive?" — You don't know.
  • "Did this agent ever call delete_file in production?" — You don't know.
  • "Why did the agent fail on Tuesday at 3pm?" — You don't know.

Mimic turns those unknowns into testable, replayable, diffable artifacts. Think Sentry recordings + pytest assertions + git blame, purpose-built for LLM agents.

Features

  • Record any callable — sync or async, LLM calls, tool use, multi-step agents
  • Replay runs offline with zero API cost, byte-for-byte deterministic
  • Assert behavioral properties: cost, latency, tool usage, output content
  • Diff two runs to see exactly what changed
  • Auto-track LLM costs for OpenAI, Anthropic, Gemini (zero-config)
  • Multi-step agents with per-step recording, cost, and metadata
  • Privacy mode (capture_args=False, capture_return=False)
  • Storage-agnostic — filesystem by default, pluggable for S3/Postgres
  • Zero LLM vendor lock-in — works with any model
  • Beautiful CLImimic run / list / show / diff / report / benchmark
  • CI-ready — GitHub Actions template + pre-commit hook included

Install

pip install mimic-recording

Or with optional integrations:

pip install mimic-recording[openai]
pip install mimic-recording[anthropic]

Quick start

mkdir my-agent && cd my-agent
mimic init

This creates a project skeleton:

my-agent/
├── mimic.yaml           # Project config
├── tests/
│   └── test_agent.py    # Your recorded tests
└── .mimic/              # Recorded runs (gitignored by default)

Edit tests/test_agent.py:

from mimic import Mimic, assert_that, replay

mimic = Mimic()

@mimic.record("my-agent", model="gpt-4o")
def answer(question: str) -> str:
    # ... your LLM call here ...
    return "..."

def test_agent():
    answer("hello")
    recorded = replay("my-agent")
    assert_that(recorded).finished_without_errors()
    assert_that(recorded).cost_less_than(usd=0.05)
    assert_that(recorded).did_not_call_tool("delete_database")

Run it:

mimic run tests/                 # records + runs (costs $$)
MIMIC_MODE=replay mimic run tests/  # replays only (free, deterministic)

Multi-step agents

For ReAct, multi-agent, or any agent with multiple LLM/tool calls, record each step:

@mimic.record("research-agent")
async def research(question: str) -> str:
    # Step 1: plan
    with mimic.step("plan", model="gpt-4o-mini") as s:
        resp = await llm.complete(model="gpt-4o-mini", messages=[...])
        tracked_completion(resp)
        s.metadata["plan_steps"] = 3

    # Step 2: search
    with mimic.step("search") as s:
        results = await web_search(question)
        s.metadata["result_count"] = len(results)

    # Step 3: synthesize
    with mimic.step("synthesize", model="gpt-4o") as s:
        resp = await llm.complete(model="gpt-4o", messages=[...])
        tracked_completion(resp)

    return summary

Assertions

The full chain (all return self for fluent chaining):

assert_that(run).finished_without_errors()
assert_that(run).had_error()                   # inverse
assert_that(run).cost_less_than(usd=0.05)
assert_that(run).completed_under(ms=2000)
assert_that(run).output_contains("substring")
assert_that(run).output_matches(r"regex")
assert_that(run).output_equals(value)
assert_that(run).called_tool("search")
assert_that(run).did_not_call_tool("delete_database")
assert_that(run).called_tools(["search", "synthesize"])
assert_that(run).had_exactly(3)
assert_that(run).had_at_least(2)
assert_that(run).used_model("gpt-4o")

How it works

Mimic sits outside your agent code, watching the inputs and outputs of any function you decorate. The first time the function runs, Mimic records the full execution into a content-addressable store. Subsequent test runs use the stored record instead of calling the LLM, making them fast, free, and deterministic.

For multi-step agents, Mimic records each step separately, so you can replay just the broken step without re-running the whole agent.

CI integration

Drop the included .github/workflows/test.yml into your repo. It runs your test suite in replay mode (no LLM cost) and validates that no cost was incurred.

Manual re-recording is a separate job, triggered on workflow_dispatch or a schedule.

The recording format

Mimic recordings are plain JSON conforming to a documented schema — see RECORDING_FORMAT.md. The format is vendor-neutral: you can build readers, web UIs, or analysis tools without depending on the Mimic library.

The $100M thesis

Mimic sits at the intersection of three exploding markets:

  1. AI agent development — 10M+ developers will build agents by 2027.
  2. AI observability — already a $2B+ market, dominated by closed vendors (LangSmith, Helicone, Langfuse).
  3. AI safety & compliance — every enterprise deploying agents needs guardrails, audit trails, and replay.

The land-and-expand model is proven (Sentry, Supabase, GitLab, Vercel, PostHog): open source core → community growth → enterprise tier with self-hosted, SSO, audit logs, and SOC2.

See BUSINESS_PLAN.md for the full strategy.

Roadmap

  • v0.1 — Record/replay/assert core
  • v0.2 — Async + multi-step + OpenAI/Anthropic cost tracking
  • v0.3 — Web UI for browsing recorded runs
  • v0.4 — TypeScript SDK
  • v0.5 — Auto-generated regression tests from production traces
  • v0.6 — Multi-agent parent/child traces
  • v1.0 — Enterprise self-hosted edition

Contributing

We love contributions. See CONTRIBUTING.md.

License

MIT — see LICENSE.

About

The pytest for AI agents. Record, replay, assert, and diff agent behavior.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors