Ship AI that actually works.
Evaluate 200+ models across 100+ benchmarks, trace agent behavior, build custom judges, and gate CI/CD on eval results.
Install · Quick Start · Compare · Docs · Examples · Discord
Stratix is built differently. It gives you production-grade evaluation infrastructure out of the box: rich public benchmarks, powerful custom judges, full agent trace analysis, playback, bulk evaluation, and CI/CD gates.
What makes it click:
- 200+ models and 100+ benchmarks, ready to query. No scraping leaderboards, no CSV wrangling.
pc.models.get()and you're looking at real evaluation data. - Prompt-level comparisons. Not just "Model A scores 82%." You get the exact prompts where Model A passes and Model B fails, with outcome filters to find the interesting divergences.
- A 4-generation eval ladder. Start with heuristic checks, graduate to model-graded scoring, add deliberation panels, then build auto-optimized GEPA judges. One SDK covers the full spectrum.
- Agent trace evaluation. Upload a multi-step agent trace, replay it, and judge every step. Built for the world where agents do real work.
- CI/CD eval gates.
layerlens ci run --threshold 0.8in your pipeline. Non-zero exit on regression. No custom scripts needed.
| Capability | Stratix | LangSmith | Langfuse | DeepEval | Phoenix (Arize) |
|---|---|---|---|---|---|
| Pre-built benchmarks | 100+ benchmarks, 200+ models | No public benchmarks | No public benchmarks | 50+ metrics | Bring your own |
| Prompt-level comparison | Native head-to-head with outcome filters | Side-by-side runs (manual) | Side-by-side runs + Playground/Experiments (UI Supported) | Manual setup | Not built-in |
| Custom judge builder | Auto-optimized GEPA judges with budget control | LLM-as-judge (manual) | LLM-as-judge (manual) | Basic LLM judges | LLM-as-judge templates |
| Agent trace evaluation | Upload, replay, judge every step | Trace logging + annotation | Trace logging + scoring | Trace logging only | Trace visualization |
| Eval generation ladder | Heuristic > model-graded > deliberation > GEPA | Single generation | Single generation | Single generation | Single generation |
| CI/CD eval gate | layerlens ci run with threshold |
Custom integration | Custom integration | deepeval test |
Manual integration |
| Evaluation Spaces | Collaborative eval environments | Hub (paid) | Not available | Not available | Not available |
| Dataset versioning | Pin evals to versions, diff between runs | Dataset management | Not built-in | Basic support | Dataset management |
| OpenTelemetry export | Native OTLP exporter | Not built-in | Native OTLP | Not built-in | Native (OpenInference) |
| Pricing model | Free public data; premium for org features | Per-trace pricing | Per-event pricing | Open source + cloud | Open source + cloud |
Free to start. PublicClient is free with an API key–query 200+ models, 50+ benchmarks, and run head-to-head comparisons. Advanced features (traces, custom judges, scorers, CI gates) require Stratix Premium. Sign up and purchase credits at app.layerlens.ai.
Note
layerlens is hosted on a private index during early access. Use the command below — the plain pip install layerlens[cli] will not work yet.
pip install --extra-index-url https://sdk.layerlens.ai/package layerlens[cli]Note
Two clients, one SDK. Use PublicClient for models, benchmarks, and comparisons. Use Stratix for traces, custom judges, scorers, and CI gates. Both take the same API key.
pip install --extra-index-url https://sdk.layerlens.ai/package layerlens[cli]Get a key from app.layerlens.ai → Settings → API Keys.
export LAYERLENS_STRATIX_API_KEY="your-api-key"from layerlens import PublicClient
pc = PublicClient()
# List available models
models = pc.models.get(page_size=10)
print(f"{models.total_count} models available")
# Compare two models head-to-head on a benchmark
comparison = pc.comparisons.compare_models(
benchmark_id="aime2024",
model_id_1="openai/gpt-4o",
model_id_2="anthropic/claude-opus-4",
outcome_filter="comparison_fails", # prompts where model 2 fails
)
print(comparison)That's it! You're comparing frontier models on real benchmark data. See full results in the dashboard →
- Run a custom evaluation ➡️ score your own model on any benchmark
- Gate CI/CD on eval results ➡️
layerlens ci run --threshold 0.8in your pipeline - Upload and evaluate agent traces ➡️ multi-step trace analysis
The SDK ships with a full CLI for managing evaluations from your terminal or CI pipeline:
# Set your API key
export LAYERLENS_STRATIX_API_KEY="your-api-key"
# List traces
layerlens trace list
# Run a judge evaluation
layerlens judge run --judge-id <id> --trace-id <id>
# Evaluate in CI mode (exits non-zero on failure)
layerlens ci run --judge-id <id> --trace-id <id> --threshold 0.8layerlens/
_client.py # Stratix (premium) client
_public_client.py # PublicClient (open data)
cli/ # Click-based CLI with rich output
commands/ # trace, judge, evaluate, scorer, space, bulk, ci
models/ # Pydantic response models
resources/ # API resource implementations
contrib/
rich_output.py # Rich terminal tables & progress bars
otel.py # OpenTelemetry integration
tracing.py # @stratix.trace decorator
datasets.py # Dataset versioning & diffs
error_suggestions.py # Context-aware error messages
The samples/ directory contains 70+ production-ready samples organized by use case. See samples/README.md for the full index.
| Category | Description |
|---|---|
| Core samples | Quickstart, traces, evaluations, judges, async workflows |
| Industry solutions | Healthcare, financial, legal, government, retail, insurance |
| CI/CD integration | Quality gates, pre-commit hooks, GitHub Actions workflow |
| Multi-agent (Cowork) | Generator-Evaluator, Code Review, RAG, Incident Response patterns |
| Content-type evaluations | Text, brand, and document quality scoring |
| LLM provider integrations | OpenAI, Anthropic, LangChain tracing and instrumentation |
| MCP server | Expose LayerLens as tools for Claude, Cursor, and any MCP-compatible assistant |
| CopilotKit CoAgents | Full-stack LangGraph + generative UI components |
| Claude Code skills | Slash commands for managing LayerLens from the Claude Code CLI |
| OpenClaw agent evaluation | Trace, evaluate, and monitor OpenClaw autonomous agents |
| Sample data | Pre-built traces, test datasets, and industry evaluation data |
Stratix powers evaluation workflows at LayerLens and across teams building production AI systems. The public benchmark data is queried thousands of times per week via the SDK and stratix.layerlens.ai.
If your team uses Stratix, open a PR to add your logo here.
The LayerLens Discord is the best place to:
- Get help with the SDK and trace evaluations
- Share your custom judges and agent workflows
- Access free Stratix Premium Credits for active contributors
- Join weekly Eval Office Hours & model comparison discussions
- Influence the roadmap
Full documentation is available at layerlens.gitbook.io/stratix-python-sdk.
To build docs locally:
pip install layerlens[docs]
mkdocs serveContributions are welcome. See CONTRIBUTING.md for guidelines.
To report a vulnerability, see SECURITY.md.
Apache 2.0. See LICENSE.
Get started in under 2 minutes:
pip install --extra-index-url https://sdk.layerlens.ai/package "layerlens[cli]"
export LAYERLENS_STRATIX_API_KEY="your-api-key"
python3 -c "from layerlens import PublicClient; pc = PublicClient(); print(pc.models.get(page_size=5))"Then explore the Quick Start guide, try a cookbook recipe, or join the Discord to ask questions and share what you're building.
⭐ Star us if you found this useful! ⭐
It helps more developers discover Stratix.
