Skip to content
View sanketny8's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report sanketny8

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
sanketny8/README.md
Typing SVG

Twitter LinkedIn Portfolio

AI Daily Digest

AI Bookmarks


About Me

I'm an AI Engineer working at the intersection of LLM inference systems, distributed backends, and cloud-native infrastructure. I focus on taking models from research to production -- building the serving layers, orchestration pipelines, and GPU infrastructure that make LLM systems work at scale.

Currently working on:

  • Low-latency, fault-tolerant backends for LLM and GenAI workloads
  • LLM serving infrastructure -- vLLM, TensorRT-LLM, KV cache optimization
  • GPU workload orchestration on Kubernetes with autoscaling and workload-aware scheduling
  • Agentic AI backends with multi-agent orchestration, tool use, and planning
  • MCP (Model Context Protocol) servers and integrations for tool-augmented LLMs
  • Observability for AI systems -- latency, drift, GPU utilization

Core Competencies — Systems, LLMs, Infrastructure

Distributed Systems & Backend

  • System Design -- microservices, event-driven architecture, CQRS, domain-driven design
  • Scalability -- horizontal sharding, consistent hashing, load balancing, distributed caching (Redis, Memcached)
  • Performance -- gRPC and protobuf for low-latency communication, async I/O, connection pooling, concurrency patterns
  • Resilience -- circuit breakers, bulkhead isolation, rate limiting, exponential backoff, graceful degradation
  • Observability -- distributed tracing (OpenTelemetry, Jaeger), metrics (Prometheus, Grafana), structured logging

LLMs & GenAI

  • RAG Systems -- hybrid retrieval (dense + sparse), cross-encoder reranking, query decomposition, evaluation with RAGAS
  • Fine-tuning -- LoRA, QLoRA, RLHF, DPO, dataset curation, evaluation pipelines, experiment tracking (MLflow, W&B)
  • LLM Agents -- multi-agent orchestration, ReAct and plan-and-execute patterns, tool use, sandboxed code execution
  • MCP -- Model Context Protocol server development, tool registries, context injection, integration with Claude and GPT
  • Inference Optimization -- vLLM, TGI, TensorRT-LLM, KV cache tuning, continuous batching, speculative decoding, quantization (GPTQ, AWQ)

Technical Stack — Languages, Frameworks, Infrastructure

AI / ML

  • Languages: Python (primary), some C++ for performance-critical paths
  • Frameworks: PyTorch, Transformers, LlamaIndex, LangChain, LangGraph
  • Agents: MCP servers, CrewAI, AutoGen, Claude Agent SDK, PydanticAI
  • Inference: vLLM (PagedAttention), TGI, ONNX Runtime, Triton Inference Server, TensorRT-LLM
  • Training: PEFT (LoRA/QLoRA), DeepSpeed ZeRO, bitsandbytes quantization, Unsloth
  • Evaluation: RAGAS, LangSmith, custom eval harnesses

Backend & Distributed Systems

  • Languages: Go (high-throughput services), Python (FastAPI, async services)
  • Messaging: Apache Kafka (event streaming), Redis Streams, RabbitMQ
  • Protocols: gRPC with protobuf, REST/OpenAPI, WebSockets for real-time
  • Patterns: Microservices, event-driven, CQRS, saga pattern for distributed transactions
  • Observability: OpenTelemetry (traces + metrics), Prometheus, Grafana, Jaeger, structured JSON logging

Cloud & Infrastructure

  • Orchestration: Kubernetes (EKS, GKE), Helm charts, Kustomize
  • IaC: Terraform for multi-cloud provisioning, Pulumi
  • Cloud: AWS (primary -- EC2, EKS, S3, SageMaker), GCP (GKE, Vertex AI)
  • Model Serving: KServe, Ray Serve, vLLM Operator, Triton
  • GPU Ops: NVIDIA GPU Operator, DCGM exporter, MIG partitioning, node affinity for GPU types (T4, A10G, A100)
  • GitOps: ArgoCD for continuous deployment, GitHub Actions for CI

Data & Storage

  • Vector DBs: Weaviate (primary), Qdrant, Pinecone, pgvector -- used for RAG, semantic search, similarity retrieval
  • Databases: PostgreSQL (OLTP), MongoDB (document store), Redis (caching + session store)
  • Processing: Kafka Streams for real-time ETL, Apache Airflow for batch orchestration
  • Search: Elasticsearch for hybrid retrieval (BM25 + dense vectors)

Featured Projects
Enterprise RAG System — Production-grade retrieval-augmented generation platform

Production RAG platform designed for accuracy and cost efficiency at scale.

  • Hybrid retrieval pipeline combining BM25 sparse search with dense vector embeddings, followed by cross-encoder reranking for precision
  • Multi-LLM router supporting OpenAI, Anthropic, and local models (vLLM) with automatic fallback and cost-aware routing
  • Advanced chunking strategies -- recursive splitting, semantic chunking, and parent-child document linking for context preservation
  • Evaluation framework using RAGAS (faithfulness, answer relevancy, context precision) with automated regression testing
  • Semantic caching layer with Redis to deduplicate similar queries and reduce LLM API costs by ~40%
  • Guardrails for hallucination detection, PII filtering, and output validation

Tech: Python, FastAPI, LangChain, Weaviate, vLLM, Redis, RAGAS

View Repository →

Autonomous LLM Agents — Multi-agent orchestration with MCP integration

Multi-agent system implementing production agentic AI patterns with tool use and planning.

  • Agent patterns -- ReAct (reason + act), plan-and-execute with dynamic replanning, reflection loops for self-correction
  • MCP server integration enabling agents to access external tools (file systems, databases, APIs) through Model Context Protocol
  • Dynamic tool registry allowing runtime tool discovery, schema validation, and permission-scoped execution
  • Multi-tier memory -- short-term (conversation buffer), episodic (vector store), and long-term (knowledge graph) for context management
  • Sandboxed execution environment with Docker containers for safe code execution, resource limits, and state snapshotting
  • Multi-agent communication using a supervisor pattern with specialized sub-agents for research, coding, and review tasks

Tech: Python, LangGraph, MCP, GPT-4, Claude, Weaviate, Docker

View Repository →

LLM Fine-tuning Platform — End-to-end fine-tuning with experiment tracking and deployment

Platform for fine-tuning LLMs from dataset preparation to production deployment.

  • Parameter-efficient fine-tuning with LoRA and QLoRA, supporting 4-bit quantization for training on consumer GPUs
  • Dataset pipeline -- ingestion, deduplication, quality filtering, instruction formatting (Alpaca, ShareGPT, ChatML)
  • Comprehensive evaluation suite -- perplexity, BLEU, ROUGE, human-eval, task-specific benchmarks with statistical significance testing
  • Experiment tracking via MLflow and Weights & Biases -- hyperparameter sweeps, loss curves, GPU utilization monitoring
  • Auto-deployment pipeline converting fine-tuned checkpoints to vLLM-compatible format with automated A/B testing
  • Multi-GPU training support through DeepSpeed ZeRO Stage 2/3 for models that don't fit on a single GPU

Tech: PyTorch, Transformers, PEFT, DeepSpeed, MLflow, Weights & Biases, vLLM

View Repository →

AI Gateway Microservices — High-performance Go backend for LLM routing and observability

Go-based API gateway designed for routing, caching, and observing LLM traffic at scale.

  • Multi-provider routing with weighted load balancing across OpenAI, Anthropic, and self-hosted models, with automatic failover on errors or latency spikes
  • Semantic caching using embedding similarity to serve cached responses for semantically equivalent queries, reducing costs
  • Token-based rate limiting with per-user and per-organization quotas, sliding window counters, and burst allowances
  • Full OpenTelemetry instrumentation -- request tracing across services, latency histograms, token usage metrics, cost attribution per tenant
  • gRPC inter-service communication with protobuf schemas, connection pooling, and circuit breakers for downstream service protection
  • Admin dashboard with real-time metrics, usage analytics, and cost breakdowns per model and per tenant

Tech: Go, gRPC, Redis, PostgreSQL, OpenTelemetry, Prometheus, Docker

View Repository →

GPU Kubernetes Platform — Production K8s infrastructure for GPU workloads and model serving

Terraform-provisioned Kubernetes platform optimized for GPU-intensive ML workloads.

  • Multi-tier GPU node pools -- T4 (dev/inference), A10G (fine-tuning), A100 (large-scale training) with workload-aware scheduling
  • Model serving stack -- KServe for autoscaling inference endpoints, vLLM Operator for LLM-specific serving, Triton for multi-model serving
  • GPU monitoring via DCGM exporter -- utilization, memory, temperature, power metrics flowing to Prometheus/Grafana dashboards
  • MIG (Multi-Instance GPU) partitioning on A100s to run multiple smaller models on a single GPU for cost efficiency
  • GitOps workflow with ArgoCD for declarative cluster state, Helm charts for service packaging, automated rollbacks on health check failures
  • Cost optimization -- spot instance integration, cluster autoscaler with GPU-aware binpacking, idle GPU detection and scale-down

Tech: Terraform, Kubernetes, Helm, ArgoCD, KServe, vLLM Operator, NVIDIA GPU Operator, Prometheus

View Repository →

LLM Engineering Fundamentals — Complete transformer from scratch, tokenization to generation

Educational implementation building a transformer from the ground up in pure NumPy.

  • BPE tokenizer implemented from scratch -- byte-pair encoding with vocabulary training, merge rules, and special token handling
  • Positional encodings -- RoPE (Rotary Position Embeddings) and ALiBi (Attention with Linear Biases) with mathematical derivations
  • Multi-head attention with causal masking, scaled dot-product, KV caching for efficient autoregressive generation
  • Decoding strategies -- greedy, beam search (with length normalization), top-k sampling, nucleus (top-p) sampling, temperature scaling
  • Full training loop with AdamW optimizer, learning rate warmup + cosine decay, gradient clipping, and checkpointing
  • 148+ passing tests across 10 project modules covering tokenization, embeddings, attention, FFN, training, and generation

Tech: Python, NumPy (zero external ML dependencies)

View Repository →


Research — Exploring attention variants in minimal GPT

Attention Variants in Minimal GPT

Systematic comparison of 6 attention mechanisms in a zero-dependency GPT implementation (~12K parameters) across 3 rounds of experiments.

Variants tested:

  • Standard multi-head attention (baseline)
  • Gated value attention -- learnable gates on value projections
  • Relational scoring -- explicit pairwise token relationship modeling
  • Salience weighting -- dynamic importance scoring before softmax
  • Context-augmented attention -- global context vector injected into attention computation
  • Difference attention -- attention based on token difference vectors

Key findings:

  • The gated-context variant consistently outperformed standard attention on character-level generation (lower loss, faster convergence)
  • Salience weighting showed the most improvement on longer sequences but added ~15% overhead
  • Difference attention underperformed on small models but may scale better (untested at larger scales)
  • All experiments run with identical hyperparameters, seeds, and training data for fair comparison

Code & Results →


Let's Connect

Open to: System design discussions, backend architecture, open source collaboration, code reviews

Pinned Loading

  1. llm-engineering-fundamentals llm-engineering-fundamentals Public

    Production-ready transformer implementation from scratch: 10 complete projects covering tokenization, positional embeddings, attention, transformers, and more. 148 passing tests, modern techniques …

    Python

  2. mini-inference-engine mini-inference-engine Public

    From-scratch LLM inference engine: paged KV cache, continuous batching, speculative decoding, OpenAI-compatible API. ~1500 lines of Python.

    Python

  3. ai-daily-digest ai-daily-digest Public

    Auto-curated daily roundup of trending AI papers, blog posts, and discussions.

    1