Zaher Khateeb zahere

Zaher Khateeb | AI/ML Engineer

Multi-agent systems • LLM infrastructure • Formal methods

Founder of AgentiCraft — an 8-layer service mesh for production multi-agent AI systems. Solo-built end-to-end across Rust, Python, Go, and TypeScript.

I work at the intersection of formal methods and production engineering — building systems that are provably correct, not just empirically okay.

What I've Built

AgentiCraft — 8-layer production platform for multi-agent AI:

Unified inference across 17 LLM providers with Thompson Sampling routing and per-destination circuit breakers — 32% cost reduction in production
MCP/A2A protocol interoperability layer with native codec handlers — 45% reduction in redundant LLM calls through lossless bidirectional translation
Full-stack platform — FastAPI control plane, Next.js dashboard on Vercel, Python and TypeScript SDKs (Apache 2.0)
Formal verification foundation — CSP process algebra, multiparty session types, CTL model checking, probabilistic verification (open source on PyPI, Apache 2.0)
Shipped multi-agent Telegram bot — 6 domain agents, persistent memory, human-in-the-loop approvals, deployed to production via Docker Compose with CI/CD

Architecture spans Layers 0–7: formal verification, NATS transport, Rust data plane, Python control plane, Kubernetes operator, developer SDK, app framework, end-user products.

Current Research

"When Does Topology Matter? Reliability Polynomials for Stochastic Service Meshes" (in progress)

An iff characterization of when network topology affects multi-agent resilience. Core result: under crash-stop faults all mesh topologies are equivalent (a mathematical identity), while Byzantine faults break that equivalence in ways determined by the coordination protocol, not the graph structure.

Validated across ~34,000 LLM experiments spanning 13 coordination topologies, two fault regimes, two task domains, and two model generations.

Technical Focus

Multi-Agent Systems — mesh coordination, fault-dependent topology selection, Byzantine fault tolerance for LLM systems, MCP/A2A protocol integration

Formal Methods — session type theory for deadlock-freedom guarantees, runtime property verification, CSP process algebra, refinement checking

LLM Infrastructure — provider-agnostic inference abstraction, statistical circuit breakers with CUSUM-optimal change detection, quality-weighted reliability theory

Distributed Systems — consensus protocols, fault injection, observability, Kubernetes-native deployment

Tech Stack

Languages: Python, Rust, Go, TypeScript, SQL, Bash

AI/ML: PyTorch, Transformers, vLLM, LangChain, RAG, Fine-tuning (LoRA/QLoRA), Inference Optimization, Agentic Workflows

Infrastructure: FastAPI, Pydantic, Next.js, Kubernetes, Docker, CI/CD, gRPC/Protobuf, OpenTelemetry, Prometheus, PostgreSQL, Redis, Qdrant

Cloud: AWS, GCP, Vercel, Nebius Cloud

Background

Founder & Lead Architect, AgentiCraft (Oct 2024 – Present)
AI & Infrastructure Engineer, Visual Arena, Gothenburg (Nov 2023 – Oct 2024)
AI Performance Engineering, Nebius Academy, Tel Aviv University (Mar 2026 – Present)
Advanced Data Science & AI (Y-DATA), Nebius Academy, Tel Aviv University (Nov 2023 – Aug 2024)
B.Sc. Industrial Engineering & Management (Data Science concentration), Tel Aviv University (2017 – 2022)

Let's Connect

Languages

English (Fluent) · Hebrew (Fluent) · Arabic (Native)

Last updated: April 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly