Ilya Shaposhnikov IlyaShaposhnikov

👋 Hi, I'm Ilya Shaposhnikov

ML/NLP/LLM Engineer building production-grade machine learning systems and intelligent language applications. I combine a linguistics background with strong software engineering to bridge the gap between research notebooks and real-world deployment.

I focus on building robust, production-ready ML pipelines — from text preprocessing and model training to explainability (SHAP), containerization, CI/CD, and deployment. Below are projects I architected and implemented: from local LLM-powered analytics bots and semantic search toolkits to end-to-end classification pipelines with >90% test coverage.

Why NLP & ML?

My path into ML wasn't traditional — and that's my edge. With a background in linguistics and Eastern languages, I spent 12+ years working with language data at an international corporation: managing NMT workflows, CAT systems, terminology databases, and translation memory. I saw firsthand how language data flows, where linguistic bottlenecks hide, and how automation saves thousands of hours.

This deep understanding of how language works — from morphology to semantics — naturally pulled me into Machine Learning. Today, I combine this linguistic intuition with modern ML engineering to build systems that truly understand human language.

🛠 Technology Stack

Category	Technologies
ML, NLP & LLMs
Backend & MLOps
Databases
DevOps, Testing & Tools
API & Security

🗣️ [Natural] Languages

🇷🇺 Russian: Native
🇬🇧 English: C2 (Proficient)
🇪🇸 Spanish: C1 (Advanced)
🇫🇷 French: B2 (Upper-Intermediate)
🇩🇪 German: B1 (Intermediate)
🇸🇦 Arabic: B1 (Intermediate)

Multilingual proficiency gives me a practical edge in NLP — I understand morphology, syntax, and semantics across language families firsthand.

🎯 Open to Opportunities

I'm looking for an ML/NLP/LLM Engineer role in a product-driven team where I can apply my unique blend of linguistic intuition, ML expertise, and production engineering to build intelligent systems that solve real-world problems.

📍 Location: Saint Petersburg, Russia (open to on-site, remote, and hybrid work)

📧 Contact: ilia.a.shaposhnikov@gmail.com

📱 Telegram: @iliashaposhnikov

💼 LinkedIn: @iliashaposhnikov

🚀 Key Projects

Category	Project	Key Technologies	Core Concept & Challenges
AI & Intelligent Systems	🤖 Video Analytics Bot	Aiogram, Ollama (LLM), PostgreSQL, asyncpg	NLP-powered bot: transforms natural language queries into SQL analytics using a local LLM (Mistral 7B) with prompt engineering and few-shot examples.
ML/NLP Pipeline	✈️ Airline Sentiment Analysis	FastAPI, Streamlit, scikit-learn, SHAP, pytest	End-to-end sentiment classification with confidence-weighted training, explainable predictions (SHAP), production REST API, and CI with >90% test coverage.
ML & NLP	🔬 Embedding Visualizer	Gensim, scikit-learn, Matplotlib	Interactive toolkit for semantic analysis of word embeddings with vector-arrow analogy visualization, PCA/t-SNE cluster projection, and Google Analogy Test Set evaluation.
ML & NLP	🎬 Movie Recommendation System	scikit-learn, pandas, TF-IDF, Aiogram	End-to-end recommendation engine based on textual features (genres, cast) with dual interfaces: Telegram bot and console app with visualization.
ML & NLP	🚫 SMS Spam Detector	scikit-learn, pandas, CLI	Modular pipeline for binary SMS classification (Naive Bayes/Logistic Regression) with CLI interface, structured logging, and interpretability via confusion matrices.
NLP Research	🔬 CountVectorizer Comparison	scikit-learn, NLTK, Matplotlib, Seaborn	Comparative analysis of 5 text preprocessing methods (lemmatization, stemming, stop-word removal, etc.) for news classification — evaluating accuracy, speed, and vocabulary size.
NLP Research	🔑 Text Keyword Extractor	scikit-learn, pandas, NLTK	In-depth TF-IDF analysis: from-scratch implementation compared against scikit-learn's version for keyword extraction, with analytical CLI.
ML Research	🔮 Pumpkin Price & Color Forecast	scikit-learn, pandas, matplotlib, EDA	Dual predictive models: regression for price forecasting (R² = 0.969) and classification for color prediction (F1 = 0.94) with interpretable outputs.
Production Backend	💰 Wallet REST API	FastAPI, PostgreSQL, Docker, async	High-load asynchronous API for financial operations with guaranteed data consistency under concurrent requests (transactions, row-level locks). Proof of production engineering skills.

📚 Full Project List

🤖 Video Analytics Bot [AI, LLM, PostgreSQL]

Intelligent Telegram bot converting natural language queries into analytical SQL queries for a video statistics database. Uses a local LLM (Ollama + Mistral 7B) for prompt engineering and code generation.

✨ Key Features:

NLP Interface: Users ask questions in natural language ("How many videos have >100K views?"), the bot returns a precise numerical answer.
Local LLM: Mistral 7B model via Ollama ensures complete data privacy, offline operation, and no limits/fees.
Prompt Engineering: Detailed system prompt with database schema description, strict rules, and few-shot examples for stable SQL query generation.
Production Architecture: Asynchronous bot on Aiogram 3.7+, optimized PostgreSQL with indexes, connection pooling via asyncpg.

🛠 Tech Stack:

📂 Project Repository

✈️ Airline Sentiment Analysis Pipeline [ML, NLP, FastAPI, Streamlit]

Production-ready end-to-end ML pipeline for airline tweet sentiment classification with confidence-aware training, explainable AI, async REST API, and interactive dashboard.

✨ Key Features:

Confidence-Aware Training: Sample weighting based on annotation confidence scores for more robust model learning.
Explainable Predictions: Per-prediction insights via top contributing words and optional SHAP value visualizations.
Production REST API: Async-safe FastAPI service with Pydantic v2 validation, thread-safe model serving, CORS support, and timeout handling.
Interactive Dashboard: Streamlit UI for single/batch predictions with CSV/JSON export, real-time health checks, and session persistence.
Comprehensive Testing: >90% coverage with unit/integration tests, GitHub Actions CI, security auditing via pip-audit.
Modular Architecture: Clean separation of data loading, preprocessing, modeling, API, and dashboard layers with centralized config management.

🛠 Tech Stack:

📂 Project Repository

🔬 Embedding Visualizer [ML, NLP, Visualization]

Interactive research toolkit for deep semantic analysis of word embeddings with vector-arrow visualizations of semantic relationships. Built with a robust, modular architecture for enhanced stability and maintainability. Enables intuitive exploration of semantic structure in Word2Vec and GloVe models through analogy visualization with directional arrows, semantic cluster projection, and quality evaluation.

✨ Key Features:

Semantic Cluster Projection: Automatic 2D mapping of seed words and their nearest neighbors with color-coded clusters using PCA (global structure) and t-SNE (local neighborhoods).
Vector-Arrow Analogy Visualization: Unique 2D plots showing semantic relationships as directional arrows (w2 → w1 and result → w3), visually demonstrating parallelism in vector arithmetic (king - man + woman = queen).
Smart Model Management: Automatic download with integrity checks, mirror fallback, and binary caching for instant subsequent loads.
Lazy Model Loading: Models load on demand via Model Manager, improving startup speed and memory efficiency.
Robust Modular Architecture: Separated concerns across services, presentation, visualization, and data layers for clean, testable code. Centralized configuration and enhanced logging.
Quality Evaluation: Testing on Google Analogy Test Set (19,544 questions) with accuracy breakdown by semantic/syntactic categories and vocabulary coverage analysis.
Zero-Code Exploration: Intuitive command-line interface with contextual help, demo mode, and persistent session — no programming required for deep semantic analysis.

🛠 Tech Stack:

📂 Project Repository

🎬 Movie Recommendation System [NLP, TF-IDF]

Film recommendation engine with dual UI: Telegram bot and console application. The system analyzes descriptions and cast using NLP and ML techniques.

✨ Key Features:

Two Algorithms: Recommendations based on genres/keywords and weighted cast analysis.
Two Interfaces: Convenient Telegram bot and a visual console interface with charts.
End-to-End Pipeline: From data preprocessing (TF-IDF) to an interactive web application.

🛠 Tech Stack:

📂 Project Repository

🚫 SMS Spam Detector [ML, NLP, CLI]

Modular pipeline for binary SMS classification using scikit-learn vectorizers and probabilistic models, designed with production-ready patterns and an intuitive command-line interface.

✨ Key Features:

Flexible Model Selection: Support for Naive Bayes (fast baseline) and Logistic Regression (higher accuracy) via --model CLI argument.
Adaptive Vectorization: Switch between CountVectorizer and TfidfVectorizer with configurable n-grams, max features, and stop-word handling.
Comprehensive Evaluation: Automated calculation of accuracy, F1, precision, recall, and ROC-AUC with JSON export for experiment tracking.
Interpretability Tools: Confusion matrix visualization via sklearn's ConfusionMatrixDisplay, word clouds for spam/ham analysis, and misclassification inspection with probability scores.
CLI-Driven Workflow: Full pipeline execution via python scripts/train.py with argparse validation, reproducibility via --random-state, and optional plot generation.
Production Patterns: Structured logging to console + file, graceful error handling with exit codes, and artifact persistence (models, metrics, plots) with timestamps.
Modular Architecture: Clean separation of concerns across data, features, models, evaluation, and visualization modules for maintainable, testable code.

🛠 Tech Stack:

📂 Project Repository

🔬 CountVectorizer Comparison Project [NLP]

Research project comparing the effectiveness of 5 text preprocessing methods (basic, stop-word removal, lemmatization, stemming, simple tokenization) when vectorizing with CountVectorizer on the BBC News dataset. Built with a modular architecture for enhanced readability and maintainability.

✨ Key Features:

Comparative Analysis: Direct comparison of 5 text processing approaches by accuracy, vocabulary size, execution time, and matrix density.
Deep NLP Focus: Implementation and evaluation of linguistic methods (lemmatization, stemming).
Clear Conclusions: Identification of the optimal method that achieved a balance of accuracy and speed.
Full Visualization & Reporting: Auto-generated comparative charts, tables, and detailed CSV reports.
Modular Design: Code structured into dedicated modules (methods, utils) following DRY principles.

🛠 Tech Stack:

📂 Project Repository

🔑 Text Keyword Extractor [TF-IDF, NLP]

Research project conducting in-depth analysis of the TF-IDF algorithm: from-scratch implementation with detailed comparison against scikit-learn's version for extracting keywords from text documents (BBC News dataset).

✨ Key Features:

TF-IDF from Zero: Clean, documented pure-Python implementation without using third-party libraries for vector representation.
Algorithm Comparison: Step-by-step comparison of manual TF-IDF (idf = log(N / df)) vs. scikit-learn's optimized TF-IDF (idf = log((1 + N) / (1 + df)) + 1 with L2 normalization).
Analytical CLI: Interactive console interface for comprehensive exploration: document search by terms, top-N keyword extraction with weights, TF-IDF comparison, and random document analysis.
Industry Practices: NLTK for stop-word processing, result pagination, modular architecture, and real-world dataset usage.

🛠 Tech Stack:

📂 Project Repository

🔮 Pumpkin Price & Color Forecast [scikit-learn, pandas, EDA]

Dual predictive modeling project for US agricultural market data: regression for pumpkin price forecasting and classification for color prediction, with emphasis on interpretability and production-ready code.

✨ Key Features:

Interpretable Regression Models: From simple linear (y = kx + b) to multivariate polynomial models achieving R² = 0.969, with clear formulas and performance metrics (MSE, RMSE).
High-Accuracy Classification: Logistic regression classifier predicting pumpkin color with F1 = 0.94 and AUC = 0.975, including threshold optimization and confusion matrix analysis.
Comprehensive EDA Pipeline: Automated exploratory analysis with visualizations of seasonality, correlations, and feature distributions saved to structured output directories.
Production-Ready Architecture: Modular code structure (src/, scripts/, utils/), reusable components, and demo script for end-to-end pipeline execution.
Practical Agricultural Economics Focus: Real-world dataset (US pumpkin market) with actionable insights on price drivers (variety, location, packaging) and color predictors.

🛠 Tech Stack:

📂 Project Repository

💰 Wallet REST API [FastAPI, PostgreSQL]

High-load asynchronous REST API for managing financial balances. The service ensures data consistency during concurrent deposit/withdrawal operations, implementing an e-wallet pattern. This project demonstrates my ability to write production-grade, concurrent backend code — essential for deploying ML models at scale.

✨ Key Features:

Concurrency Safety: Guarantees data integrity through READ COMMITTED transactions and SELECT ... FOR UPDATE row-level locks.
Production-Grade Stack: Full cycle from asynchronous backend to containerization.
Deployment Ready: Fully configured for Docker with orchestration (docker-compose).
Comprehensive Documentation: Auto-generated interactive OpenAPI (Swagger) documentation at /docs.

🛠 Tech Stack:

📂 Project Repository

📚 Full Project List

🇷🇺 Russian Version / На русском

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ilya Shaposhnikov IlyaShaposhnikov

Achievements

Achievements

Block or report IlyaShaposhnikov

👋 Hi, I'm Ilya Shaposhnikov

Why NLP & ML?

🛠 Technology Stack

🗣️ [Natural] Languages

🎯 Open to Opportunities

🚀 Key Projects

🤖 Video Analytics Bot [AI, LLM, PostgreSQL]

✈️ Airline Sentiment Analysis Pipeline [ML, NLP, FastAPI, Streamlit]

🔬 Embedding Visualizer [ML, NLP, Visualization]

🎬 Movie Recommendation System [NLP, TF-IDF]

🚫 SMS Spam Detector [ML, NLP, CLI]

🔬 CountVectorizer Comparison Project [NLP]

🔑 Text Keyword Extractor [TF-IDF, NLP]

🔮 Pumpkin Price & Color Forecast [scikit-learn, pandas, EDA]

💰 Wallet REST API [FastAPI, PostgreSQL]

Pinned Loading

Uh oh!