Kernel Forge automatically generates and optimizes GPU kernels for PyTorch models with no kernel programming expertise required. It profiles your model at the operator level, uses an LLM to write a correct kernel, then searches for performance improvements using Monte Carlo Tree Search until the kernel beats PyTorch's baseline.
- ML engineers running models in production who want lower inference latency on specific hardware without writing CUDA or Triton by hand.
- AI infrastructure teams targeting specific GPU hardware (NVIDIA CUDA or AMD ROCm) who need kernels tuned to that exact device.
- Teams with remote GPU access who run optimization on a separate GPU server while managing projects locally.
- Researchers benchmarking operator-level speedups across different LLM backends or optimization strategies.
- Teams packaging models for deployment who want an inference artifact with model weights and optimized kernels baked in.
- Automated kernel generation via LLM with compile-error feedback loop
- MCTS-driven optimization - explores tiling, loop unrolling, vectorized memory access, and more
- CUDA and Triton backends (NVIDIA and AMD ROCm)
- Remote execution over SSH - no local GPU required
- Multi-LLM support: Anthropic, OpenAI, Google
- Web dashboard with live progress, speed charts, and MCTS tree inspector
- Portable
.anvilsnapshots and deployment-oriented.castinference packages
The arXiv evaluation reports opt50 operator-level results on four real PyTorch workloads running on an NVIDIA DGX Spark with GB10 GPU. The most favorable generated-kernel wins are measured against the PyTorch eager path for the same captured operator inputs.
- Best generated-kernel wins:
2.83xon Gemma 4 E2B softmax,1.70xon Stable Diffusion 3.5 Medium group normalization,1.54xon Qwen 3.5 35B-A3B softmax, and1.52xon ResNet-50 adaptive average pooling. - Across ResNet-50, Stable Diffusion 3.5 Medium, Gemma 4 E2B, and Qwen 3.5 35B-A3B, 14 generated operator candidates outperform PyTorch eager at opt50.
- Kernel Forge uses guarded dispatch: generated kernels are selected only where they improve measured operator latency, while PyTorch eager is preserved for stronger framework or vendor-backed paths.
See system requirements before installing.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd frontend
jac installConfigure your LLM key in the settings panel after starting, or set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY before launch.
jac start main.jacOpen http://localhost:8000. Create a project, upload your model weights, and click Start Forge.
For headless or scripted runs, see docs/cli.md.

