๐ท๐บ Russian Version / ะะฐ ััััะบะพะผ
ML/NLP/LLM Engineer building production-grade machine learning systems and intelligent language applications. I combine a linguistics background with strong software engineering to bridge the gap between research notebooks and real-world deployment.
I focus on building robust, production-ready ML pipelines โ from text preprocessing and model training to explainability (SHAP), containerization, CI/CD, and deployment. Below are projects I architected and implemented: from local LLM-powered analytics bots and semantic search toolkits to end-to-end classification pipelines with >90% test coverage.
My path into ML wasn't traditional โ and that's my edge. With a background in linguistics and Eastern languages, I spent 12+ years working with language data at an international corporation: managing NMT workflows, CAT systems, terminology databases, and translation memory. I saw firsthand how language data flows, where linguistic bottlenecks hide, and how automation saves thousands of hours.
This deep understanding of how language works โ from morphology to semantics โ naturally pulled me into Machine Learning. Today, I combine this linguistic intuition with modern ML engineering to build systems that truly understand human language.
| Category | Technologies |
|---|---|
| ML, NLP & LLMs | |
| Backend & MLOps | |
| Databases | |
| DevOps, Testing & Tools | |
| API & Security |
- ๐ท๐บ Russian: Native
- ๐ฌ๐ง English: C2 (Proficient)
- ๐ช๐ธ Spanish: C1 (Advanced)
- ๐ซ๐ท French: B2 (Upper-Intermediate)
- ๐ฉ๐ช German: B1 (Intermediate)
- ๐ธ๐ฆ Arabic: B1 (Intermediate)
Multilingual proficiency gives me a practical edge in NLP โ I understand morphology, syntax, and semantics across language families firsthand.
I'm looking for an ML/NLP/LLM Engineer role in a product-driven team where I can apply my unique blend of linguistic intuition, ML expertise, and production engineering to build intelligent systems that solve real-world problems.
๐ Location: Saint Petersburg, Russia (open to on-site, remote, and hybrid work)
๐ง Contact: ilia.a.shaposhnikov@gmail.com
๐ฑ Telegram: @iliashaposhnikov
๐ผ LinkedIn: @iliashaposhnikov
| Category | Project | Key Technologies | Core Concept & Challenges |
|---|---|---|---|
| AI & Intelligent Systems | ๐ค Video Analytics Bot | Aiogram, Ollama (LLM), PostgreSQL, asyncpg | NLP-powered bot: transforms natural language queries into SQL analytics using a local LLM (Mistral 7B) with prompt engineering and few-shot examples. |
| ML/NLP Pipeline | FastAPI, Streamlit, scikit-learn, SHAP, pytest | End-to-end sentiment classification with confidence-weighted training, explainable predictions (SHAP), production REST API, and CI with >90% test coverage. | |
| ML & NLP | ๐ฌ Embedding Visualizer | Gensim, scikit-learn, Matplotlib | Interactive toolkit for semantic analysis of word embeddings with vector-arrow analogy visualization, PCA/t-SNE cluster projection, and Google Analogy Test Set evaluation. |
| ML & NLP | ๐ฌ Movie Recommendation System | scikit-learn, pandas, TF-IDF, Aiogram | End-to-end recommendation engine based on textual features (genres, cast) with dual interfaces: Telegram bot and console app with visualization. |
| ML & NLP | ๐ซ SMS Spam Detector | scikit-learn, pandas, CLI | Modular pipeline for binary SMS classification (Naive Bayes/Logistic Regression) with CLI interface, structured logging, and interpretability via confusion matrices. |
| NLP Research | ๐ฌ CountVectorizer Comparison | scikit-learn, NLTK, Matplotlib, Seaborn | Comparative analysis of 5 text preprocessing methods (lemmatization, stemming, stop-word removal, etc.) for news classification โ evaluating accuracy, speed, and vocabulary size. |
| NLP Research | ๐ Text Keyword Extractor | scikit-learn, pandas, NLTK | In-depth TF-IDF analysis: from-scratch implementation compared against scikit-learn's version for keyword extraction, with analytical CLI. |
| ML Research | ๐ฎ Pumpkin Price & Color Forecast | scikit-learn, pandas, matplotlib, EDA | Dual predictive models: regression for price forecasting (Rยฒ = 0.969) and classification for color prediction (F1 = 0.94) with interpretable outputs. |
| Production Backend | ๐ฐ Wallet REST API | FastAPI, PostgreSQL, Docker, async | High-load asynchronous API for financial operations with guaranteed data consistency under concurrent requests (transactions, row-level locks). Proof of production engineering skills. |
๐ Full Project List
Intelligent Telegram bot converting natural language queries into analytical SQL queries for a video statistics database. Uses a local LLM (Ollama + Mistral 7B) for prompt engineering and code generation.
โจ Key Features:
- NLP Interface: Users ask questions in natural language ("How many videos have >100K views?"), the bot returns a precise numerical answer.
- Local LLM: Mistral 7B model via Ollama ensures complete data privacy, offline operation, and no limits/fees.
- Prompt Engineering: Detailed system prompt with database schema description, strict rules, and few-shot examples for stable SQL query generation.
- Production Architecture: Asynchronous bot on Aiogram 3.7+, optimized PostgreSQL with indexes, connection pooling via asyncpg.
Production-ready end-to-end ML pipeline for airline tweet sentiment classification with confidence-aware training, explainable AI, async REST API, and interactive dashboard.
โจ Key Features:
- Confidence-Aware Training: Sample weighting based on annotation confidence scores for more robust model learning.
- Explainable Predictions: Per-prediction insights via top contributing words and optional SHAP value visualizations.
- Production REST API: Async-safe FastAPI service with Pydantic v2 validation, thread-safe model serving, CORS support, and timeout handling.
- Interactive Dashboard: Streamlit UI for single/batch predictions with CSV/JSON export, real-time health checks, and session persistence.
- Comprehensive Testing: >90% coverage with unit/integration tests, GitHub Actions CI, security auditing via pip-audit.
- Modular Architecture: Clean separation of data loading, preprocessing, modeling, API, and dashboard layers with centralized config management.
Interactive research toolkit for deep semantic analysis of word embeddings with vector-arrow visualizations of semantic relationships. Built with a robust, modular architecture for enhanced stability and maintainability. Enables intuitive exploration of semantic structure in Word2Vec and GloVe models through analogy visualization with directional arrows, semantic cluster projection, and quality evaluation.
โจ Key Features:
- Semantic Cluster Projection: Automatic 2D mapping of seed words and their nearest neighbors with color-coded clusters using PCA (global structure) and t-SNE (local neighborhoods).
- Vector-Arrow Analogy Visualization: Unique 2D plots showing semantic relationships as directional arrows (
w2 โ w1andresult โ w3), visually demonstrating parallelism in vector arithmetic (king - man + woman = queen). - Smart Model Management: Automatic download with integrity checks, mirror fallback, and binary caching for instant subsequent loads.
- Lazy Model Loading: Models load on demand via Model Manager, improving startup speed and memory efficiency.
- Robust Modular Architecture: Separated concerns across
services,presentation,visualization, anddatalayers for clean, testable code. Centralized configuration and enhanced logging. - Quality Evaluation: Testing on Google Analogy Test Set (19,544 questions) with accuracy breakdown by semantic/syntactic categories and vocabulary coverage analysis.
- Zero-Code Exploration: Intuitive command-line interface with contextual help, demo mode, and persistent session โ no programming required for deep semantic analysis.
Film recommendation engine with dual UI: Telegram bot and console application. The system analyzes descriptions and cast using NLP and ML techniques.
โจ Key Features:
- Two Algorithms: Recommendations based on genres/keywords and weighted cast analysis.
- Two Interfaces: Convenient Telegram bot and a visual console interface with charts.
- End-to-End Pipeline: From data preprocessing (TF-IDF) to an interactive web application.
Modular pipeline for binary SMS classification using scikit-learn vectorizers and probabilistic models, designed with production-ready patterns and an intuitive command-line interface.
โจ Key Features:
- Flexible Model Selection: Support for Naive Bayes (fast baseline) and Logistic Regression (higher accuracy) via
--modelCLI argument. - Adaptive Vectorization: Switch between CountVectorizer and TfidfVectorizer with configurable n-grams, max features, and stop-word handling.
- Comprehensive Evaluation: Automated calculation of accuracy, F1, precision, recall, and ROC-AUC with JSON export for experiment tracking.
- Interpretability Tools: Confusion matrix visualization via sklearn's
ConfusionMatrixDisplay, word clouds for spam/ham analysis, and misclassification inspection with probability scores. - CLI-Driven Workflow: Full pipeline execution via
python scripts/train.pywith argparse validation, reproducibility via--random-state, and optional plot generation. - Production Patterns: Structured logging to console + file, graceful error handling with exit codes, and artifact persistence (models, metrics, plots) with timestamps.
- Modular Architecture: Clean separation of concerns across
data,features,models,evaluation, andvisualizationmodules for maintainable, testable code.
Research project comparing the effectiveness of 5 text preprocessing methods (basic, stop-word removal, lemmatization, stemming, simple tokenization) when vectorizing with CountVectorizer on the BBC News dataset. Built with a modular architecture for enhanced readability and maintainability.
โจ Key Features:
- Comparative Analysis: Direct comparison of 5 text processing approaches by accuracy, vocabulary size, execution time, and matrix density.
- Deep NLP Focus: Implementation and evaluation of linguistic methods (lemmatization, stemming).
- Clear Conclusions: Identification of the optimal method that achieved a balance of accuracy and speed.
- Full Visualization & Reporting: Auto-generated comparative charts, tables, and detailed CSV reports.
- Modular Design: Code structured into dedicated modules (
methods,utils) following DRY principles.
Research project conducting in-depth analysis of the TF-IDF algorithm: from-scratch implementation with detailed comparison against scikit-learn's version for extracting keywords from text documents (BBC News dataset).
โจ Key Features:
- TF-IDF from Zero: Clean, documented pure-Python implementation without using third-party libraries for vector representation.
- Algorithm Comparison: Step-by-step comparison of manual TF-IDF (
idf = log(N / df)) vs. scikit-learn's optimized TF-IDF (idf = log((1 + N) / (1 + df)) + 1with L2 normalization). - Analytical CLI: Interactive console interface for comprehensive exploration: document search by terms, top-N keyword extraction with weights, TF-IDF comparison, and random document analysis.
- Industry Practices:
NLTKfor stop-word processing, result pagination, modular architecture, and real-world dataset usage.
Dual predictive modeling project for US agricultural market data: regression for pumpkin price forecasting and classification for color prediction, with emphasis on interpretability and production-ready code.
โจ Key Features:
- Interpretable Regression Models: From simple linear (
y = kx + b) to multivariate polynomial models achieving Rยฒ = 0.969, with clear formulas and performance metrics (MSE, RMSE). - High-Accuracy Classification: Logistic regression classifier predicting pumpkin color with F1 = 0.94 and AUC = 0.975, including threshold optimization and confusion matrix analysis.
- Comprehensive EDA Pipeline: Automated exploratory analysis with visualizations of seasonality, correlations, and feature distributions saved to structured output directories.
- Production-Ready Architecture: Modular code structure (
src/,scripts/,utils/), reusable components, and demo script for end-to-end pipeline execution. - Practical Agricultural Economics Focus: Real-world dataset (US pumpkin market) with actionable insights on price drivers (variety, location, packaging) and color predictors.
High-load asynchronous REST API for managing financial balances. The service ensures data consistency during concurrent deposit/withdrawal operations, implementing an e-wallet pattern. This project demonstrates my ability to write production-grade, concurrent backend code โ essential for deploying ML models at scale.
โจ Key Features:
- Concurrency Safety: Guarantees data integrity through
READ COMMITTEDtransactions andSELECT ... FOR UPDATErow-level locks. - Production-Grade Stack: Full cycle from asynchronous backend to containerization.
- Deployment Ready: Fully configured for Docker with orchestration (
docker-compose). - Comprehensive Documentation: Auto-generated interactive OpenAPI (Swagger) documentation at
/docs.
๐ Full Project List
