Ship AI that actually works. Evaluate 200+ models across 100+ benchmarks, trace agent behavior, build custom judges, and gate CI/CD on eval results.
Install · Quick Start · Compare · Docs · Examples · Discord
Stratix is built differently. It gives you production-grade evaluation infrastructure out of the box: rich public benchmarks, powerful custom judges, full agent trace analysis, playback, bulk evaluation, and CI/CD gates.
What makes it click:
- 200+ models and 100+ benchmarks, ready to query. No scraping leaderboards, no CSV wrangling.
pc.models.get()and you're looking at real evaluation data. - Prompt-level comparisons. Not just "Model A scores 82%." You get the exact prompts where Model A passes and Model B fails, with outcome filters to find the interesting divergences.
- A 4-generation eval ladder. Start with heuristic checks, graduate to model-graded scoring, add deliberation panels, then build auto-optimized GEPA judges. One SDK covers the full spectrum.
- Agent trace evaluation. Upload a multi-step agent trace, replay it, and judge every step. Built for the world where agents do real work.
- CI/CD eval gates.
layerlens ci run --threshold 0.8in your pipeline. Non-zero exit on regression. No custom scripts needed.
| Capability | Stratix | LangSmith | Langfuse | DeepEval | Phoenix (Arize) |
|---|---|---|---|---|---|
| Pre-built benchmarks | 100+ benchmarks, 200+ models | No public benchmarks | No public benchmarks | ~14 metrics | Bring your own |
| Prompt-level comparison | Native head-to-head with outcome filters | Side-by-side runs (manual) | Not built-in | Manual setup | Not built-in |
| Custom judge builder | Auto-optimized GEPA judges with budget control | LLM-as-judge (manual) | LLM-as-judge (manual) | Basic LLM judges | LLM-as-judge templates |
| Agent trace evaluation | Upload, replay, judge every step | Trace logging + annotation | Trace logging + scoring | Trace logging only | Trace visualization |
| Eval generation ladder | Heuristic > model-graded > deliberation > GEPA | Single generation | Single generation | Single generation | Single generation |
| CI/CD eval gate | layerlens ci run with threshold |
Custom integration | Custom integration | deepeval test |
Manual integration |
| Evaluation Spaces | Collaborative eval environments | Hub (paid) | Not available | Not available | Not available |
| Dataset versioning | Pin evals to versions, diff between runs | Dataset management | Not built-in | Basic support | Dataset management |
| OpenTelemetry export | Native OTLP exporter | Not built-in | Native OTLP | Not built-in | Native (OpenInference) |
| Pricing model | Free public data; premium for org features | Per-trace pricing | Per-event pricing | Open source + cloud | Open source + cloud |
# Recommended (includes CLI, rich output, and examples)
pip install layerlens[cli]Note: During early access the package is hosted on a private index. Use:
pip install --extra-index-url https://sdk.layerlens.ai/package layerlens[cli]
Easiest way — use the one-command template:
stratix init my-first-eval
cd my-first-eval
python main.pyOr wire it up yourself in Python:
from layerlens import PublicClient, Stratix
# Public data (models, benchmarks, evaluations)
pc = PublicClient(api_key="your-api-key")
models = pc.models.get(page_size=200)
print(f"{models.total_count} models available")
# Compare two models head-to-head at prompt level
comparison = pc.comparisons.compare_models(
benchmark_id="benchmark-id",
model_id_1="model-a",
model_id_2="model-b",
outcome_filter="comparison_fails", # where model B fails
)
# Premium features (traces, judges, scorers)
client = Stratix(api_key="your-api-key")
# Upload and evaluate an agent trace
client.traces.upload("trace.json")
eval_result = client.trace_evaluations.create(
trace_id="trace-id",
judge_id="judge-id",
)The SDK ships with a full CLI for managing evaluations from your terminal or CI pipeline:
# Set your API key
export LAYERLENS_STRATIX_API_KEY="your-api-key"
# List traces
layerlens trace list
# Run a judge evaluation
layerlens judge run --judge-id <id> --trace-id <id>
# Evaluate in CI mode (exits non-zero on failure)
layerlens ci run --judge-id <id> --trace-id <id> --threshold 0.8layerlens/
_client.py # Stratix (premium) client
_public_client.py # PublicClient (open data)
cli/ # Click-based CLI with rich output
commands/ # trace, judge, evaluate, scorer, space, bulk, ci
models/ # Pydantic response models
resources/ # API resource implementations
contrib/
rich_output.py # Rich terminal tables & progress bars
otel.py # OpenTelemetry integration
tracing.py # @stratix.trace decorator
datasets.py # Dataset versioning & diffs
error_suggestions.py # Context-aware error messages
See the examples/ directory for integration patterns:
| Example | Description |
|---|---|
| LangGraph | Trace and evaluate a LangGraph agent |
| CrewAI | Evaluate CrewAI multi-agent workflows |
| AutoGen | Instrument AutoGen conversations |
| CI/CD Gate | Block deploys on eval regression |
| Custom Judge | Build and optimize a domain judge |
| Prompt Playground | Compare prompt variations side-by-side |
Stratix powers evaluation workflows at LayerLens and across teams building production AI systems. The public benchmark data is queried thousands of times per week via the SDK and stratix.layerlens.ai.
If your team uses Stratix, open a PR to add your logo here.
The LayerLens Discord is the best place to:
- Get help with the SDK and trace evaluations
- Share your custom judges and agent workflows
- Access free Stratix Premium Credits for active contributors
- Join weekly Eval Office Hours & model comparison discussions
- Influence the roadmap
Full documentation is available at layerlens.gitbook.io/stratix-python-sdk.
To build docs locally:
pip install layerlens[docs]
mkdocs serveContributions are welcome. See CONTRIBUTING.md for guidelines.
To report a vulnerability, see SECURITY.md.
Apache 2.0. See LICENSE.
Get started in under 2 minutes:
pip install --extra-index-url https://sdk.layerlens.ai/package layerlens[cli]
stratix init my-first-eval
cd my-first-eval && python main.pyThen explore the Quick Start guide, try a cookbook recipe, or join the Discord to ask questions and share what you're building.
⭐ Star us if you found this useful! ⭐
It helps more developers discover Stratix.