Interactive LLM infrastructure calculator for throughput, latency, memory, and cost modeling. - Infrastructure Engineering Manual
Stop guessing your GPU setup. Start reasoning from first principles.
This is a single-page interactive tool that helps you:
- Estimate LLM throughput (prefill + decode)
- Predict latency metrics (TTFT, ITL)
- Calculate VRAM usage (weights + KV cache)
- Understand memory vs compute bottlenecks
- Choose the right GPU configuration
- Model real-world cloud costs
Most LLM infra decisions are made using:
- vendor benchmarks
- random configs
- trial-and-error
This leads to:
- ❌ Over-provisioned GPUs
- ❌ Poor latency despite high cost
- ❌ OOM crashes at scale
- ❌ Misunderstood bottlenecks
LLM performance is governed by physics, not guesswork.
- Why is streaming slow?
- How to reduce ITL?
- Why does 128K context break?
- How big is my KV cache really?
- What batch size maximizes throughput?
- What is my tokens/sec per GPU?
- Prefill → compute-bound
- Decode → memory-bound
- Scales with:
batch × context × layers
-
Visualizes:
- Compute ceiling
- Memory ceiling
- Actual bottleneck
-
Models real-world:
- PCIe vs NVLink
- AllReduce overhead
- Fragmentation overhead
- Autoscaling penalties
- Cold start I/O limits
- System tokens/sec
- TTFT (Time to First Token)
- ITL (Inter-token latency)
- Model weights
- KV cache
- Total VRAM required
- Single GPU vs Multi-GPU
- TP scaling impact
- Interconnect constraints
- Monthly GPU cost
- Cloud provider comparison
- Autoscaling buffer impact
- 🎯 Workload-based presets (Chat / RAG / Batch)
- 📈 Roofline visualization
- 🧠 Physics-based throughput model
- 🧮 KV cache calculator
- 🧭 GPU recommendation engine
- 💰 Cloud TCO estimator
- 📋 GPU spec sheets (A100 → B300)
This tool provides first-order estimates based on:
- Transformer architectures
- Optimized inference engines (vLLM / TRT-LLM)
- Dense model assumptions
Real-world performance depends on:
- kernel efficiency
- scheduler behavior
- request distribution
- system-level overhead
- Decode is memory-bound, not compute-bound
- KV cache can exceed model size
- Batch size controls throughput and latency
- Multi-GPU ≠ free scaling
- Bandwidth > TFLOPS (for inference)
- AI Infra Engineers
- Backend Engineers moving into GenAI
- ML Engineers deploying LLMs
- Platform teams running vLLM / Triton