Kernel Forge

Drop-in GPU kernel optimizer for PyTorch models.

Kernel Forge automatically generates and optimizes GPU kernels for PyTorch models with no kernel programming expertise required. It profiles your model at the operator level, uses an LLM to write a correct kernel, then searches for performance improvements using Monte Carlo Tree Search until the kernel beats PyTorch's baseline.

Who is this for?

ML engineers running models in production who want lower inference latency on specific hardware without writing CUDA or Triton by hand.
AI infrastructure teams targeting specific GPU hardware (NVIDIA CUDA or AMD ROCm) who need kernels tuned to that exact device.
Teams with remote GPU access who run optimization on a separate GPU server while managing projects locally.
Researchers benchmarking operator-level speedups across different LLM backends or optimization strategies.
Teams packaging models for deployment who want an inference artifact with model weights and optimized kernels baked in.

Features

Automated kernel generation via LLM with compile-error feedback loop
MCTS-driven optimization - explores tiling, loop unrolling, vectorized memory access, and more
CUDA and Triton backends (NVIDIA and AMD ROCm)
Remote execution over SSH - no local GPU required
Multi-LLM support: Anthropic, OpenAI, Google
Web dashboard with live progress, speed charts, and MCTS tree inspector
Portable .anvil snapshots and deployment-oriented .cast inference packages

Full feature details

Benchmark Snapshot

The arXiv evaluation reports opt50 operator-level results on four real PyTorch workloads running on an NVIDIA DGX Spark with GB10 GPU. The most favorable generated-kernel wins are measured against the PyTorch eager path for the same captured operator inputs.

Best generated-kernel wins: 2.83x on Gemma 4 E2B softmax, 1.70x on Stable Diffusion 3.5 Medium group normalization, 1.54x on Qwen 3.5 35B-A3B softmax, and 1.52x on ResNet-50 adaptive average pooling.
Across ResNet-50, Stable Diffusion 3.5 Medium, Gemma 4 E2B, and Qwen 3.5 35B-A3B, 14 generated operator candidates outperform PyTorch eager at opt50.
Kernel Forge uses guarded dispatch: generated kernels are selected only where they improve measured operator latency, while PyTorch eager is preserved for stronger framework or vendor-backed paths.

Quick start

See system requirements before installing.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd frontend
jac install

Configure your LLM key in the settings panel after starting, or set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY before launch.

jac start main.jac

Open http://localhost:8000. Create a project, upload your model weights, and click Start Forge.

CLI

For headless or scripted runs, see docs/cli.md.

Name		Name	Last commit message	Last commit date
Latest commit History 635 Commits
docs		docs
frontend		frontend
kernelforge		kernelforge
kernels/generated		kernels/generated
paper_benchmarks		paper_benchmarks
src		src
test-output/n1		test-output/n1
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
blog_post.md		blog_post.md
castlib-plan.md		castlib-plan.md
print_prompt.py		print_prompt.py
prompt_output.txt		prompt_output.txt
requirements.txt		requirements.txt
roadmap.md		roadmap.md
ui-features.md		ui-features.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kernel Forge

Who is this for?

Features

Benchmark Snapshot

Quick start

CLI

Further reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kernel Forge

Who is this for?

Features

Benchmark Snapshot

Quick start

CLI

Further reading

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages