Production-quality performance validation and regression testing for LLM inference systems.
Install: pip install llmtest-perf
PyPI: https://pypi.org/project/llmtest-perf/
llmtest-perf is a pytest-like performance validation framework for LLM inference systems. It helps engineering teams answer critical questions before deploying model or infrastructure changes:
- Did latency regress after upgrading to a new model version?
- Did throughput improve or degrade with the new runtime configuration?
- What happened to TTFT (time to first token) and token generation speed?
- How does the system behave under realistic mixed workloads?
- Is this deployment safe to promote based on our SLOs?
This is not a generic benchmark tool. It's a release-gating and regression-testing framework designed for CI/CD pipelines and production deployment validation.
- Workload-aware testing - Define realistic mixed workloads with weighted prompt sets
- CI-friendly - Pass/fail based on SLOs and regression thresholds
- Comparison-first - Built-in baseline vs candidate comparison mode
- Developer-friendly - Declarative YAML configs, rich console output
- Practical metrics - P50/P90/P95/P99 latency, TTFT, throughput, error rates
- Async engine - High-performance httpx-based async workload runner
- Multiple outputs - Console, JSON, and self-contained HTML reports
- Extensible - Clean provider abstraction (OpenAI-compatible first)
llmtest-perf is built for:
- ML Engineers validating model performance before deployment
- Infrastructure Engineers testing LLM serving optimizations
- Platform Teams implementing SLO-based release gates
- DevOps/SRE teams running performance regression tests in CI
# Install from source
cd llmtest-perf
pip install -e .
# Install with dev dependencies
pip install -e ".[dev]"Requirements:
- Python 3.11+
- API access to an OpenAI-compatible endpoint
llmtest-perf init demoThis creates demo.yaml with example configuration.
Update the endpoints and API keys:
targets:
baseline:
base_url: "http://localhost:8000/v1"
model: "your-model"
api_key_env: "OPENAI_API_KEY"
candidate:
base_url: "http://localhost:8001/v1"
model: "your-model-v2"
api_key_env: "OPENAI_API_KEY"export OPENAI_API_KEY="your-key"
llmtest-perf run demo.yaml --target baselinellmtest-perf compare demo.yamlprovider: openai_compatible
targets:
baseline:
base_url: "http://localhost:8000/v1"
model: "gpt-3.5-turbo"
api_key_env: "OPENAI_API_KEY"
candidate:
base_url: "http://localhost:8001/v1"
model: "gpt-4-turbo"
api_key_env: "OPENAI_API_KEY"
workload:
duration_seconds: 60
max_concurrency: 32
ramp_up_seconds: 10
stream: true
prompt_sets:
- name: short_qa
weight: 40
prompts:
- "What is the capital of France?"
- "Explain TCP vs UDP briefly."
- name: long_context
weight: 30
prompts:
- "Summarize this architecture document: ..."
- name: structured_output
weight: 30
prompts:
- "Return JSON with keys: summary, sentiment for this text."
request:
max_tokens: 256
temperature: 0.0
timeout_seconds: 60
slos:
p95_latency_ms: 2500
ttft_ms: 1200
output_tokens_per_sec: 40
error_rate_percent: 1.0
comparison:
fail_on_regression: true
max_p95_latency_regression_percent: 10
max_ttft_regression_percent: 10
max_output_tokens_per_sec_drop_percent: 10
max_error_rate_increase_percent: 1
reporting:
json: "artifacts/results.json"
html: "artifacts/report.html"
console: true
seed: 42 # For reproducible workloadsTargets - Define baseline and/or candidate deployments
base_url: OpenAI-compatible endpointmodel: Model identifierapi_key_env: Environment variable for API key
Workload - Configure test execution
duration_seconds: How long to run the testmax_concurrency: Maximum concurrent requestsramp_up_seconds: Gradual ramp-up periodstream: Use streaming responses (captures TTFT)prompt_sets: Weighted collections of prompts
Request - LLM parameters
max_tokens,temperature,timeout_seconds,top_p
SLOs - Absolute thresholds (optional)
p95_latency_ms,ttft_ms,output_tokens_per_sec,error_rate_percent
Comparison - Regression detection (optional)
fail_on_regression: Exit with error on regressionmax_*_regression_percent: Allowed regression thresholds
Reporting - Output configuration
json,html,console
# Run all targets
llmtest-perf run config.yaml
# Run specific target
llmtest-perf run config.yaml --target baseline
# Suppress console output
llmtest-perf run config.yaml --quietOutputs:
- Console summary (if enabled)
- JSON results (if configured)
- HTML report (if configured)
- Exit code 0 on success, 1 on SLO violation
llmtest-perf compare config.yamlRequires: Both baseline and candidate targets in config
Outputs:
- Side-by-side comparison table
- Regression/improvement detection
- Final verdict (PASS/FAIL)
- Exit code 0 if no regressions, 1 if regressions detected
llmtest-perf validate config.yamlChecks YAML syntax and schema validation without running tests.
llmtest-perf init demoCreates a demo configuration file to get started.
Results for: baseline
โโโโโโโโโโโโโโโโโโโโณโโโโโโโโโ
โ Metric โ Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Total Requests โ 1,847 โ
โ Successful โ 1,842 โ
โ Failed โ 5 โ
โ Error Rate โ 0.27% โ
โ Duration โ 60.12s โ
โ Throughput โ 30.7/s โ
โ Token Throughput โ 45.2/s โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโ
โโโโโโโโโโโโโโณโโโโโโโโโโ
โ Percentile โ Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโฉ
โ P50 โ 1,842ms โ
โ P90 โ 2,104ms โ
โ P95 โ 2,287ms โ
โ P99 โ 2,891ms โ
โโโโโโโโโโโโโโดโโโโโโโโโโ
Comparison: baseline vs candidate
โโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโณโโโโโโโโโโโโโโ
โ Metric โ Baseline โ Candidate โ Delta โ Status โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ P95 Latency (ms) โ 2287.45 โ 2612.38 โ +14.2% โ REGRESSION โ
โ TTFT (ms) โ 1142.22 โ 1046.11 โ -8.4% โ IMPROVEMENT โ
โ Output Tokens/Sec โ 45.23 โ 39.76 โ -12.1% โ REGRESSION โ
โ Error Rate (%) โ 0.27 โ 0.32 โ +18.5% โ OK โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ FAIL โ
โ โ
โ Performance regression detected (2 โ
โ metrics) โ
โ โ
โ Recommendation: DO NOT PROMOTE - Fix โ
โ regressions before deploying โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Compare inference performance between model versions:
targets:
baseline:
base_url: "https://api.example.com/v1"
model: "gpt-3.5-turbo"
candidate:
base_url: "https://api.example.com/v1"
model: "gpt-4-turbo"Run: llmtest-perf compare config.yaml
Test serving optimizations or hardware changes:
targets:
baseline:
base_url: "http://old-cluster.internal/v1"
model: "llama-2-70b"
candidate:
base_url: "http://new-cluster.internal/v1"
model: "llama-2-70b"Gate deployments on SLO compliance:
slos:
p95_latency_ms: 2000
ttft_ms: 800
error_rate_percent: 0.5
comparison:
fail_on_regression: trueIntegrate into CI: fails if SLOs violated or regressions detected.
Model production traffic patterns:
prompt_sets:
- name: quick_questions
weight: 60
prompts: [...]
- name: complex_analysis
weight: 30
prompts: [...]
- name: code_generation
weight: 10
prompts: [...]name: LLM Performance Tests
on:
pull_request:
paths:
- 'models/**'
- 'infrastructure/**'
jobs:
perf-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install llmtest-perf
run: |
pip install -e .
- name: Run performance comparison
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
llmtest-perf compare .github/perf-config.yaml
- name: Upload reports
if: always()
uses: actions/upload-artifact@v3
with:
name: perf-reports
path: artifacts/llmtest-perf/
โโโ config/ # YAML config loading and validation (Pydantic)
โโโ providers/ # LLM provider abstraction (OpenAI-compatible)
โโโ engine/ # Async workload runner, metrics, scheduling
โโโ compare/ # Baseline vs candidate comparison logic
โโโ reporting/ # Console, JSON, HTML report generation
โโโ cli.py # Typer-based CLI
- Async I/O: httpx-based for high concurrency
- Concurrency control: Semaphore-based with configurable limits
- Ramp-up: Linear ramp-up to avoid cold-start bias
- Streaming: Captures TTFT when streaming is enabled
- Reproducibility: Optional random seed for deterministic workloads
Per-request:
- End-to-end latency
- Time to first token (TTFT) for streaming
- Input/output token counts
- Error type and message
Aggregated:
- P50/P90/P95/P99 percentiles
- Mean, min, max
- Request throughput (req/sec)
- Token throughput (tok/sec)
- Error rate percentage
- Per-prompt-set breakdown
- OpenAI-compatible only: Currently supports OpenAI Chat Completions API format
- HTTP-based: No gRPC or other protocols yet
- Single-region: No multi-region testing
- Token counting: Relies on provider-reported token counts
- No payload size limits: Large prompts may need external file support
See Roadmap section below.
Future enhancements under consideration:
- Additional providers: Anthropic, Google, AWS Bedrock, Azure OpenAI
- Advanced workloads: Prompt templates, external prompt files, payload generation
- Cost tracking: Track API costs alongside performance
- Historical trending: Track metrics over time
- Warmup phase: Pre-test warmup to eliminate cold starts
- Custom metrics: Plugin system for custom metric collection
- Distributed testing: Multi-node workload generation
- Real-time monitoring: Live dashboard during test runs
# Clone repo
git clone <repo-url>
cd llmtest-perf
# Install in development mode
pip install -e ".[dev]"# Run all tests
pytest
# With coverage
pytest --cov=llmtest_perf --cov-report=html
# Type checking
mypy src/llmtest_perf
# Linting
ruff check src/llmtest-perf/
โโโ src/llmtest_perf/ # Main package
โ โโโ config/ # Configuration models
โ โโโ engine/ # Workload execution
โ โโโ providers/ # Provider implementations
โ โโโ compare/ # Comparison logic
โ โโโ reporting/ # Report generation
โ โโโ cli.py # CLI interface
โโโ tests/ # Test suite
โโโ examples/ # Example configs
โโโ artifacts/ # Generated reports (gitignored)
โโโ README.md
Contributions are welcome! Areas of interest:
- Additional provider implementations
- Performance optimizations
- Additional metrics and analysis
- Documentation improvements
- Bug reports and fixes
MIT License - see LICENSE file for details.
llmtest-perf complements llmtest (a correctness/safety testing framework) by focusing exclusively on performance validation. Together, they provide comprehensive testing for LLM applications:
- llmtest: Grounding, safety, prompt injection, behavioral regression
- llmtest-perf: Latency, throughput, TTFT, performance regression
Use both for complete release validation.
Built for engineers who ship LLM systems to production.