I build machine learning infrastructure from first principles, focusing on distributed training systems, transformer inference, observability, and performance engineering.
My work explores how communication overhead, memory scaling, synchronization cost, and inference latency shape real-world ML system behavior.
| Project | Focus |
|---|---|
| Atlas AI | Distributed AI infrastructure platform for transformer systems, inference optimization, observability, and performance engineering |
| Distributed Training Profiler | Systems profiler for communication overhead, scaling efficiency, memory bottlenecks, and ZeRO optimization analysis |
| Benchmark Guardian | Automated benchmark regression detection platform with GitHub App integration and performance intelligence workflows |
| Distributed Training Simulator | Data-parallel scaling simulation with all-reduce communication analysis |
| Autograd Engine | Reverse-mode autodiff engine with dynamic computation graphs and scaling analysis |
| ML Reproducibility Auditor | Systems-oriented auditor for reproducibility, engineering quality, and ML infrastructure signals |
- Measure before optimizing
- Treat memory as a first-class constraint
- Make trade-offs explicit
- Design reproducible and observable systems
- Transformer inference systems
- Distributed runtime behavior
- Communication and synchronization overhead
- Memory-aware ML infrastructure
- Benchmark automation and regression analysis
- Observability for ML systems
| Area | Technologies |
|---|---|
| Languages | Python · C++ |
| Infrastructure | FastAPI · GitHub Apps · SQLite · CI/CD |
| ML Systems | Distributed Training · Autograd · Transformers |
| Performance | Profiling · Benchmarking · Memory Analysis |
| Systems | Multiprocessing · Synchronization · Communication |

