I'm an AI Engineer working at the intersection of LLM inference systems, distributed backends, and cloud-native infrastructure. I focus on taking models from research to production -- building the serving layers, orchestration pipelines, and GPU infrastructure that make LLM systems work at scale.
Currently working on:
- Low-latency, fault-tolerant backends for LLM and GenAI workloads
- LLM serving infrastructure -- vLLM, TensorRT-LLM, KV cache optimization
- GPU workload orchestration on Kubernetes with autoscaling and workload-aware scheduling
- Agentic AI backends with multi-agent orchestration, tool use, and planning
- MCP (Model Context Protocol) servers and integrations for tool-augmented LLMs
- Observability for AI systems -- latency, drift, GPU utilization
Core Competencies — Systems, LLMs, Infrastructure
Distributed Systems & Backend
- System Design -- microservices, event-driven architecture, CQRS, domain-driven design
- Scalability -- horizontal sharding, consistent hashing, load balancing, distributed caching (Redis, Memcached)
- Performance -- gRPC and protobuf for low-latency communication, async I/O, connection pooling, concurrency patterns
- Resilience -- circuit breakers, bulkhead isolation, rate limiting, exponential backoff, graceful degradation
- Observability -- distributed tracing (OpenTelemetry, Jaeger), metrics (Prometheus, Grafana), structured logging
LLMs & GenAI
- RAG Systems -- hybrid retrieval (dense + sparse), cross-encoder reranking, query decomposition, evaluation with RAGAS
- Fine-tuning -- LoRA, QLoRA, RLHF, DPO, dataset curation, evaluation pipelines, experiment tracking (MLflow, W&B)
- LLM Agents -- multi-agent orchestration, ReAct and plan-and-execute patterns, tool use, sandboxed code execution
- MCP -- Model Context Protocol server development, tool registries, context injection, integration with Claude and GPT
- Inference Optimization -- vLLM, TGI, TensorRT-LLM, KV cache tuning, continuous batching, speculative decoding, quantization (GPTQ, AWQ)
Technical Stack — Languages, Frameworks, Infrastructure
AI / ML
- Languages: Python (primary), some C++ for performance-critical paths
- Frameworks: PyTorch, Transformers, LlamaIndex, LangChain, LangGraph
- Agents: MCP servers, CrewAI, AutoGen, Claude Agent SDK, PydanticAI
- Inference: vLLM (PagedAttention), TGI, ONNX Runtime, Triton Inference Server, TensorRT-LLM
- Training: PEFT (LoRA/QLoRA), DeepSpeed ZeRO, bitsandbytes quantization, Unsloth
- Evaluation: RAGAS, LangSmith, custom eval harnesses
Backend & Distributed Systems
- Languages: Go (high-throughput services), Python (FastAPI, async services)
- Messaging: Apache Kafka (event streaming), Redis Streams, RabbitMQ
- Protocols: gRPC with protobuf, REST/OpenAPI, WebSockets for real-time
- Patterns: Microservices, event-driven, CQRS, saga pattern for distributed transactions
- Observability: OpenTelemetry (traces + metrics), Prometheus, Grafana, Jaeger, structured JSON logging
Cloud & Infrastructure
- Orchestration: Kubernetes (EKS, GKE), Helm charts, Kustomize
- IaC: Terraform for multi-cloud provisioning, Pulumi
- Cloud: AWS (primary -- EC2, EKS, S3, SageMaker), GCP (GKE, Vertex AI)
- Model Serving: KServe, Ray Serve, vLLM Operator, Triton
- GPU Ops: NVIDIA GPU Operator, DCGM exporter, MIG partitioning, node affinity for GPU types (T4, A10G, A100)
- GitOps: ArgoCD for continuous deployment, GitHub Actions for CI
Data & Storage
- Vector DBs: Weaviate (primary), Qdrant, Pinecone, pgvector -- used for RAG, semantic search, similarity retrieval
- Databases: PostgreSQL (OLTP), MongoDB (document store), Redis (caching + session store)
- Processing: Kafka Streams for real-time ETL, Apache Airflow for batch orchestration
- Search: Elasticsearch for hybrid retrieval (BM25 + dense vectors)
Featured Projects
Enterprise RAG System — Production-grade retrieval-augmented generation platform
Production RAG platform designed for accuracy and cost efficiency at scale.
- Hybrid retrieval pipeline combining BM25 sparse search with dense vector embeddings, followed by cross-encoder reranking for precision
- Multi-LLM router supporting OpenAI, Anthropic, and local models (vLLM) with automatic fallback and cost-aware routing
- Advanced chunking strategies -- recursive splitting, semantic chunking, and parent-child document linking for context preservation
- Evaluation framework using RAGAS (faithfulness, answer relevancy, context precision) with automated regression testing
- Semantic caching layer with Redis to deduplicate similar queries and reduce LLM API costs by ~40%
- Guardrails for hallucination detection, PII filtering, and output validation
Tech: Python, FastAPI, LangChain, Weaviate, vLLM, Redis, RAGAS
Autonomous LLM Agents — Multi-agent orchestration with MCP integration
Multi-agent system implementing production agentic AI patterns with tool use and planning.
- Agent patterns -- ReAct (reason + act), plan-and-execute with dynamic replanning, reflection loops for self-correction
- MCP server integration enabling agents to access external tools (file systems, databases, APIs) through Model Context Protocol
- Dynamic tool registry allowing runtime tool discovery, schema validation, and permission-scoped execution
- Multi-tier memory -- short-term (conversation buffer), episodic (vector store), and long-term (knowledge graph) for context management
- Sandboxed execution environment with Docker containers for safe code execution, resource limits, and state snapshotting
- Multi-agent communication using a supervisor pattern with specialized sub-agents for research, coding, and review tasks
Tech: Python, LangGraph, MCP, GPT-4, Claude, Weaviate, Docker
LLM Fine-tuning Platform — End-to-end fine-tuning with experiment tracking and deployment
Platform for fine-tuning LLMs from dataset preparation to production deployment.
- Parameter-efficient fine-tuning with LoRA and QLoRA, supporting 4-bit quantization for training on consumer GPUs
- Dataset pipeline -- ingestion, deduplication, quality filtering, instruction formatting (Alpaca, ShareGPT, ChatML)
- Comprehensive evaluation suite -- perplexity, BLEU, ROUGE, human-eval, task-specific benchmarks with statistical significance testing
- Experiment tracking via MLflow and Weights & Biases -- hyperparameter sweeps, loss curves, GPU utilization monitoring
- Auto-deployment pipeline converting fine-tuned checkpoints to vLLM-compatible format with automated A/B testing
- Multi-GPU training support through DeepSpeed ZeRO Stage 2/3 for models that don't fit on a single GPU
Tech: PyTorch, Transformers, PEFT, DeepSpeed, MLflow, Weights & Biases, vLLM
AI Gateway Microservices — High-performance Go backend for LLM routing and observability
Go-based API gateway designed for routing, caching, and observing LLM traffic at scale.
- Multi-provider routing with weighted load balancing across OpenAI, Anthropic, and self-hosted models, with automatic failover on errors or latency spikes
- Semantic caching using embedding similarity to serve cached responses for semantically equivalent queries, reducing costs
- Token-based rate limiting with per-user and per-organization quotas, sliding window counters, and burst allowances
- Full OpenTelemetry instrumentation -- request tracing across services, latency histograms, token usage metrics, cost attribution per tenant
- gRPC inter-service communication with protobuf schemas, connection pooling, and circuit breakers for downstream service protection
- Admin dashboard with real-time metrics, usage analytics, and cost breakdowns per model and per tenant
Tech: Go, gRPC, Redis, PostgreSQL, OpenTelemetry, Prometheus, Docker
GPU Kubernetes Platform — Production K8s infrastructure for GPU workloads and model serving
Terraform-provisioned Kubernetes platform optimized for GPU-intensive ML workloads.
- Multi-tier GPU node pools -- T4 (dev/inference), A10G (fine-tuning), A100 (large-scale training) with workload-aware scheduling
- Model serving stack -- KServe for autoscaling inference endpoints, vLLM Operator for LLM-specific serving, Triton for multi-model serving
- GPU monitoring via DCGM exporter -- utilization, memory, temperature, power metrics flowing to Prometheus/Grafana dashboards
- MIG (Multi-Instance GPU) partitioning on A100s to run multiple smaller models on a single GPU for cost efficiency
- GitOps workflow with ArgoCD for declarative cluster state, Helm charts for service packaging, automated rollbacks on health check failures
- Cost optimization -- spot instance integration, cluster autoscaler with GPU-aware binpacking, idle GPU detection and scale-down
Tech: Terraform, Kubernetes, Helm, ArgoCD, KServe, vLLM Operator, NVIDIA GPU Operator, Prometheus
LLM Engineering Fundamentals — Complete transformer from scratch, tokenization to generation
Educational implementation building a transformer from the ground up in pure NumPy.
- BPE tokenizer implemented from scratch -- byte-pair encoding with vocabulary training, merge rules, and special token handling
- Positional encodings -- RoPE (Rotary Position Embeddings) and ALiBi (Attention with Linear Biases) with mathematical derivations
- Multi-head attention with causal masking, scaled dot-product, KV caching for efficient autoregressive generation
- Decoding strategies -- greedy, beam search (with length normalization), top-k sampling, nucleus (top-p) sampling, temperature scaling
- Full training loop with AdamW optimizer, learning rate warmup + cosine decay, gradient clipping, and checkpointing
- 148+ passing tests across 10 project modules covering tokenization, embeddings, attention, FFN, training, and generation
Tech: Python, NumPy (zero external ML dependencies)
Research — Exploring attention variants in minimal GPT
Systematic comparison of 6 attention mechanisms in a zero-dependency GPT implementation (~12K parameters) across 3 rounds of experiments.
Variants tested:
- Standard multi-head attention (baseline)
- Gated value attention -- learnable gates on value projections
- Relational scoring -- explicit pairwise token relationship modeling
- Salience weighting -- dynamic importance scoring before softmax
- Context-augmented attention -- global context vector injected into attention computation
- Difference attention -- attention based on token difference vectors
Key findings:
- The gated-context variant consistently outperformed standard attention on character-level generation (lower loss, faster convergence)
- Salience weighting showed the most improvement on longer sequences but added ~15% overhead
- Difference attention underperformed on small models but may scale better (untested at larger scales)
- All experiments run with identical hyperparameters, seeds, and training data for fair comparison
- Twitter/X: @Ny8Sanket
- LinkedIn: ny8sanket
- Portfolio: sanketny8.github.io
- Daily Digest: ai-daily-digest -- auto-curated trending AI papers, repos & blogs
- AI Bookmarks: sanketny8.github.io/ai-bookmarks -- curated AI resources synced from browser
Open to: System design discussions, backend architecture, open source collaboration, code reviews