170+ science and language models — one endpoint, one key, from protein folds to LLMs.
One point. Every model. Infinite unity.
Deployed on the University of Alberta / AMII Vulcan environment for multi-model GPU inference
Maintained by: Rahim Khoja (khoja1@ualberta.ca) and Karim Ali (kali2@ualberta.ca)
Aleph is a complete multi-model inference platform — a self-deploying RKE2 Kubernetes cluster, HAMi GPU virtualization, KServe/Knative serving, 170+ model definitions, and a FastAPI gateway that unifies them all behind one OpenAI- and Anthropic-compatible endpoint. Protein structure, genomics, materials, climate, astronomy, medical imaging, and LLMs — one key, one URL, scale-to-zero.
Not a chatbot stack. CERN runs a similar KServe-based platform for physics inference at ml.cern.ch; Aleph is that idea applied across all research science — AlphaFold alongside Gemma, MACE alongside Qwen, NeuralGCM alongside DeepSeek. The gateway discovers models from the Kubernetes API — a labeled details.yaml ConfigMap plus live InferenceService state — and routes by model ID to predictor pods through the Knative local gateway. No model names are hardcoded; add a model by applying YAML, and it appears in the catalog. Shaped by the DOE's Genesis Mission push for autonomous science, DeepMind's agent work, Chinese AI labs' open releases, and CERN's production platform.
Models are callable like any other HTTP API — from a Slurm batch job, an agentic pipeline, or a notebook — using a standard OpenAI SDK and a single key. ESMFold to fold a protein, MACE for energy minimization, Aurora for a weather forecast, an LLM to synthesize the results, all through the same endpoint. Idle models drop to zero GPU and wake on first request; science embeddings (SciBERT, ESM2, DNABERT, AstroCLIP) make domain-literature RAG a native use case.
- One endpoint, every model — OpenAI (
/v1/chat/completions,/v1/embeddings,/v1/rerank) and Anthropic (/v1/messages) APIs, plus custom science/vision routes (/v1/science/*,/v1/vision/*,/v1/dock,/v1/forecast, etc.) - Kubernetes-native discovery — the gateway watches the K8s API for
details.yamlConfigMaps (labeledmodel-details=true) and merges liveInferenceServicestate; apply YAML and the model appears in the catalog, no restart, nothing hardcoded - Science models first — proteins, DNA, RNA, molecules, materials, weather, astronomy, medical imaging, time-series, and audio alongside general-purpose LLMs
- Any KServe runtime — KServe is the orchestration layer, not the engine, so a model card can back onto whatever serves it best: vLLM (most LLMs), Hugging Face/TEI (embeddings, rerank), ONNX Runtime (vision), JAX/Lightning/TensorFlow or a custom FastAPI server.py (science), and NVIDIA NIM (boltz-2, openfold-3). Triton, TorchServe, and TensorFlow Serving are equally deployable when a model calls for them.
- Fractional GPU scheduling — HAMi slices each L40S into virtual GPUs (
nvidia.com/gpumem); many models share one physical card - Scale-to-zero + cold-start aware — idle models drop to zero pods; first request gets a
503 + retry-afterwhile the pod wakes; agent loops handle this natively - Catch-all auth — one key, sent however your SDK likes:
Authorization: Bearer(OpenAI/Cohere),x-api-key(Anthropic),api-key(Azure OpenAI),x-goog-api-key(Google), or query string?api_key=/?api-key=/?key=; Tyk normalizes them all intoAuthorization: Bearerbefore auth - Usage accounting — per-request JSON-lines log (identity, tokens, GPU SKU, node, gpu-seconds) + Prometheus metrics on
/metrics; rate-limiting via Tyk - NFS-backed weights — model weights on shared NFS PVCs; survive pod and node churn without re-download
Bake the Warewulf overlays, boot the nodes, and the cluster self-deploys; then issue a key and apply model YAML. Full walkthrough — cluster bring-up, secrets, deploying and testing models, and adding a new one — is in QUICKSTART.md.
The models/ directory contains 170+ models across scientific and language domains:
| Domain | Examples |
|---|---|
| Protein / Structural biology | AlphaFold2, Boltz-2, ESMFold, ESM2, ESM-C 300M, ProstT5, LigandMPNN, DiffDock, SaProt |
| Genomics / DNA / RNA | Nucleotide Transformer, DNABERT-2, GENA-LM, Borzoi, Enformer, Caduceus, RNAbert |
| Materials / Chemistry | MACE-MH-1, MACE-MP, CHGNet, ChemBERTa, MatterSim, CrystalLLM, ChemGPT |
| Weather / Climate | Aurora, GraphCast, FourCastNet3, Pangu-Weather, NeuralGCM, ClimaX, FengWu |
| Astronomy | AstroCLIP, AstroPT, AstroSage, Zoobot |
| Medical / Imaging | MedGemma, BiomedCLIP, TotalSegmentator, MedSAM, ClinicalBERT |
| Vision / 3D | FLUX.1, Kandinsky 3, DUSt3R, MASt3R, YOLOv8, Mask R-CNN, Depth Anything |
| Time-series / Audio | Chronos-Bolt, TimesFM, TTM, XTTS-v2, BirdNET, CLAP |
| Language models | Gemma 3/4, Qwen 3/3.5/3.6, GLM-4/Z1, GPT OSS 20B/120B, DeepSeek R1, Command-R |
| Science NLP | SciBERT, BioGPT, SciNCL, SpecTer2, OceanGPT, GeoGalactica, OpenBioLLM |
Each model in models/<name>/ includes a details.yaml card, inferenceservice.yaml, pvc.yaml, and a test.py battery. Adding one is a few files plus kubectl apply — see QUICKSTART.md.
HPC job / SDK / curl
│ HTTPS (OpenAI or Anthropic dialect)
▼
┌─────────────────┐
│ MetalLB │ public VIP (L2) advertised out the head node's public NIC;
│ │ hands traffic to the Traefik LoadBalancer Service
└────────┬────────┘
▼
┌─────────────────┐ ◄── cert-manager + Let's Encrypt (ACME HTTP-01) issues the
│ Traefik (RKE2) │ public TLS cert; Traefik terminates HTTPS here, redirects
│ (TLS terminate) │ :80 → :443, and routes by Host header to the Tyk Service
└────────┬────────┘
▼
┌─────────────────┐
│ Tyk OSS │ catch-all auth (Bearer / x-api-key / api-key / x-goog-api-key / ?api_key),
│ │ rate-limit, JSVM middleware stamps X-Aleph-* identity headers
└────────┬────────┘
▼
┌─────────────────┐
│ model-gateway │ FastAPI: OAI⇄Anthropic translation, card-based routing,
│ (FastAPI) │ cold-start guard (503 + Retry-After), usage accounting
└────────┬────────┘
▼
┌─────────────────┐
│ Istio mesh │ service mesh Knative programs; knative-local-gateway routes
│ │ by Host header to the live revision — or to the Knative
│ │ activator, which holds the request while a cold pod boots
└────────┬────────┘
▼
┌─────────────────┐
│ KServe ISVC pod │ vLLM / TEI / ONNX / JAX / NIM / custom FastAPI on a
│ │ HAMi vGPU slice; weights on NFS PVC
└─────────────────┘
The gateway image is published to Docker Hub (rkhoja/aleph) on every push to main touching gateway/** — tagged latest (moving) and gateway-<sha> (immutable). Roll out a new build:
kubectl rollout restart deploy/model-gateway -n modelsWhat boots, in order. Control-plane nodes carry the full RKE2 auto-deploy manifest set, applied on first boot:
| Manifest | Does |
|---|---|
00–01 cert-manager + ClusterIssuer |
ACME/TLS for the public endpoint |
10 HAMi |
vGPU device plugin + scheduler (DaemonSet, gpu=on nodes only) |
11 node-labeler |
DaemonSet detects each worker's GPU/CPU/RAM and stamps aleph.* node labels — every usage record carries real hardware provenance |
30 NFS |
nfs-models StorageClass — the default; model weights live here |
40–41 MetalLB |
L2 load-balancer + VIP IPAddressPool |
50–54 Tyk |
Redis, OSS gateway, Traefik IngressRoute + TLS exposure (RKE2-bundled Traefik fronts it), API definitions, JSVM middleware |
60 Istio |
Service mesh + scaffolding the serving stack needs |
61–62 Knative + KServe |
Scale-to-zero autoscaling and the InferenceService CRD |
63 model-gateway |
The FastAPI router (runs on control-plane nodes only) |
70 RDMA device plugin |
Exposes the RoCE NIC as rdma/roce so NCCL runs collectives over RDMA — required for multi-GPU tensor-parallel models |
- University of Alberta Research Computing
- Alberta Machine Intelligence Institute (AMII)
- Digital Research Alliance of Canada
- HAMi — Heterogeneous AI Computing Virtualization Middleware
- WareWulf RKE2 + Hami Node Image
Many Bothans died to bring us this information. This project is provided as-is, but reasonable questions may be answered based on my coffee intake or mood. ;)
Feel free to open an issue or email khoja1@ualberta.ca or kali2@ualberta.ca for U of A related deployments.
This project is released under the MIT License — use it, modify it, distribute it, include it in proprietary software. Keep the copyright notice. That's it.
Full license text: MIT License
The Research Computing Group supports high-performance computing, data-intensive research, and advanced infrastructure for researchers at the University of Alberta and across Canada.
We help design and operate compute environments that power innovation — from AI training clusters to national research infrastructure.
