Skip to content
View zahere's full-sized avatar

Block or report zahere

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
zahere/README.md

Zaher Khateeb | AI/ML Engineer

Multi-agent systems • LLM infrastructure • Formal methods

Founder of AgentiCraft — an 8-layer service mesh for production multi-agent AI systems. Solo-built end-to-end across Rust, Python, Go, and TypeScript.

I work at the intersection of formal methods and production engineering — building systems that are provably correct, not just empirically okay.

Open to Senior AI/ML + Platform Engineering roles Website LinkedIn


What I've Built

AgentiCraft — 8-layer production platform for multi-agent AI:

  • Unified inference across 17 LLM providers with Thompson Sampling routing and per-destination circuit breakers — 32% cost reduction in production
  • MCP/A2A protocol interoperability layer with native codec handlers — 45% reduction in redundant LLM calls through lossless bidirectional translation
  • Full-stack platform — FastAPI control plane, Next.js dashboard on Vercel, Python and TypeScript SDKs (Apache 2.0)
  • Formal verification foundation — CSP process algebra, multiparty session types, CTL model checking, probabilistic verification (open source on PyPI, Apache 2.0)
  • Shipped multi-agent Telegram bot — 6 domain agents, persistent memory, human-in-the-loop approvals, deployed to production via Docker Compose with CI/CD

Architecture spans Layers 0–7: formal verification, NATS transport, Rust data plane, Python control plane, Kubernetes operator, developer SDK, app framework, end-user products.


Current Research

"When Does Topology Matter? Reliability Polynomials for Stochastic Service Meshes" (in progress)

An iff characterization of when network topology affects multi-agent resilience. Core result: under crash-stop faults all mesh topologies are equivalent (a mathematical identity), while Byzantine faults break that equivalence in ways determined by the coordination protocol, not the graph structure.

Validated across ~34,000 LLM experiments spanning 13 coordination topologies, two fault regimes, two task domains, and two model generations.


Technical Focus

Multi-Agent Systems — mesh coordination, fault-dependent topology selection, Byzantine fault tolerance for LLM systems, MCP/A2A protocol integration

Formal Methods — session type theory for deadlock-freedom guarantees, runtime property verification, CSP process algebra, refinement checking

LLM Infrastructure — provider-agnostic inference abstraction, statistical circuit breakers with CUSUM-optimal change detection, quality-weighted reliability theory

Distributed Systems — consensus protocols, fault injection, observability, Kubernetes-native deployment


Tech Stack

Languages: Python, Rust, Go, TypeScript, SQL, Bash

AI/ML: PyTorch, Transformers, vLLM, LangChain, RAG, Fine-tuning (LoRA/QLoRA), Inference Optimization, Agentic Workflows

Infrastructure: FastAPI, Pydantic, Next.js, Kubernetes, Docker, CI/CD, gRPC/Protobuf, OpenTelemetry, Prometheus, PostgreSQL, Redis, Qdrant

Cloud: AWS, GCP, Vercel, Nebius Cloud


Background

  • Founder & Lead Architect, AgentiCraft (Oct 2024 – Present)
  • AI & Infrastructure Engineer, Visual Arena, Gothenburg (Nov 2023 – Oct 2024)
  • AI Performance Engineering, Nebius Academy, Tel Aviv University (Mar 2026 – Present)
  • Advanced Data Science & AI (Y-DATA), Nebius Academy, Tel Aviv University (Nov 2023 – Aug 2024)
  • B.Sc. Industrial Engineering & Management (Data Science concentration), Tel Aviv University (2017 – 2022)

Let's Connect

Email LinkedIn AgentiCraft

Languages

English (Fluent) · Hebrew (Fluent) · Arabic (Native)

Last updated: April 2026

Pinned Loading

  1. stochastic-circuit-breaker stochastic-circuit-breaker Public

    Statistically optimal circuit breaker for stochastic systems. 4-state CUSUM-based FSM with provably minimax detection delay (Moustakides 1986). Zero dependencies.

    Python

  2. reliability-polynomials reliability-polynomials Public

    Generalized reliability polynomials for quality-weighted network analysis. Every reliability library assumes binary survival — this one doesn't. Zero dependencies.

    Python