An AI-powered analytics platform that discovers actual process flows, detects bottlenecks, predicts delays, and generates optimization recommendations from enterprise event logs.
- Overview
- Business Problem
- Solution
- Live Demo
- Architecture
- Project Structure
- Core Modules
- Tech Stack
- Dataset
- Getting Started
- ETL Pipeline
- Dashboard
- API
- Roadmap
- Future Enhancements
Traditional organizations define ideal workflows on paper, but actual execution frequently differs due to delays, rework loops, bottlenecks, and manual interventions. This platform bridges that gap.
The Business Process Mining & Optimization Platform ingests raw event logs from enterprise systems, reconstructs the actual process flow using the PM4Py library, computes KPIs, identifies deviations from the ideal path, and surfaces bottlenecks — all through an interactive Streamlit dashboard and a FastAPI backend.
Built in three phases:
| Phase | Focus | Status |
|---|---|---|
| Phase 1 | ETL · Process Discovery · KPIs · Variants · Dashboard · API | ✅ Complete |
| Phase 2 | ML Predictions · AI Recommendations · React Frontend | 🔄 In Progress |
| Phase 3 | GenAI Copilot · RAG Assistant · Real-time Monitoring | 🗓 Planned |
Organizations face recurring operational challenges that are difficult to diagnose without data:
- Delayed approvals — cases sitting idle between process steps
- Inefficient workflows — redundant steps adding no value
- SLA violations — tickets breaching resolution time thresholds
- Process bottlenecks — specific activities causing systemic slowdowns
- High operational costs — rework loops consuming team capacity
- Lack of visibility — business leaders relying on assumptions, not data
Standard BI tools show aggregated metrics but cannot reveal how the process actually executes case by case.
This platform applies process mining — an IEEE-standard analytical discipline — to reconstruct and analyse the true process from event log data.
What it delivers:
- Discovers the actual process flow (not the assumed one) as a visual Sankey diagram
- Computes cycle time, wait time, SLA breach rate, escalation rate, throughput
- Identifies the top process variants and how often each occurs
- Compares the ideal (happy) path against all deviating cases
- Ranks bottleneck activities by average wait time
- Detects rework loops — cases that visit the same step more than once
- (Phase 2) Predicts which cases are at risk of SLA breach before it happens
- (Phase 3) Answers natural language questions: "Why are tickets being escalated?"
# Clone and run locally — see Getting Started below
streamlit run dashboard/app.pyScreenshots below show Phase 1 dashboard on the helpdesk dataset (1M tickets).
┌─────────────────────────────────────────────────────────────┐
│ Data Sources │
│ CSV / Excel · XES files · ERP · CRM · HRMS · Helpdesk │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ETL Pipeline (etl/) │
│ Extract → Validate → Transform → Load → PostgreSQL │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Process Mining Engine │
│ process_mining/discovery.py · process_mining/variants │
│ PM4Py DFG · Heuristic Miner · Variant Analysis │
└────────────────────────┬───────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Analytics Layer │
│ analytics/kpis.py · analytics/bottlenecks.py │
│ Cycle Time · SLA · Escalation · Throughput · Rework │
└──────────────┬─────────────────────────┬───────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌────────────────────────────────┐
│ Streamlit Dashboard │ │ FastAPI Backend │
│ dashboard/app.py │ │ api/main.py │
│ Sankey · KPI Cards │ │ /api/kpis · /api/variants │
│ Heatmap · Variants │ │ /api/graph · /api/cases │
└──────────────────────┘ └────────────────────────────────┘
BPMO/
│
├── .env # DB credentials (never committed)
├── .streamlit/
│ └── config.toml # Streamlit light theme config
├── config.yaml # All platform configuration
├── config_loader.py # Single import point for config + env
├── requirements.txt
├── README.md
│
├── data/
│ └── raw/ # Raw dataset files
│ └── helpdesk_tickets.csv
│
├── database/
│ └── schema.sql # PostgreSQL table definitions
│
├── etl/ # Phase 1 — Step 1
│ ├── extractor.py # Read CSV / XES / Excel / JSON
│ ├── validator.py # Column checks · null checks · row count
│ ├── transformer.py # Clean · enrich · compute features
│ ├── loader.py # Write to PostgreSQL in chunks
│ └── pipeline.py # Orchestrator: Extract→Validate→Transform→Load
│
├── process_mining/ # Phase 1 — Steps 2 & 4
│ ├── discovery.py # DFG · Heuristic Miner · PM4Py
│ └── variants.py # Variant frequency · ideal vs actual
│
├── analytics/ # Phase 1 — Step 3
│ ├── kpis.py # Cycle time · SLA · escalation · throughput
│ └── bottlenecks.py # Wait times · slow transitions · rework
│
├── dashboard/ # Phase 1 — Step 5
│ ├── app.py # Streamlit entry point
│ └── components.py # Reusable Plotly chart components
│
├── api/ # Phase 1 — Step 6
│ ├── main.py # FastAPI application
│ └── routes.py # All REST endpoints
│
└── tests/
└── __init__.py
Converts raw ticket data into a standardised event log stored in PostgreSQL.
The helpdesk CSV has one row per ticket. Process mining requires one row per event. The transformer explodes each ticket into 2–3 events:
Ticket #1001 → ticket opened (@ Date_Created)
→ ticket escalated (@ Date_Created + 1hr) [if escalated]
→ ticket resolved (@ Date_Resolved) [if resolved]
→ status: open (@ Date_Created + 2hrs) [if still open]
Output tables:
| Table | Rows | Description |
|---|---|---|
event_log |
~2.5M | One row per event with case_id, activity, timestamp, wait_time_mins, cycle_time_mins |
cases |
1M | One summary row per ticket — start/end time, cycle time, escalation flag |
etl_runs |
N | Audit trail of every pipeline execution |
Reads the event log from PostgreSQL and runs two PM4Py algorithms:
- DFG (Directly-Follows Graph) — counts every activity → activity transition. Becomes the Sankey diagram on the dashboard.
- Heuristic Miner — filters noise from the DFG, showing only dominant process paths.
Computes all headline metrics with optional filters by source, priority, and category:
| KPI | Formula |
|---|---|
| Avg cycle time | Mean of (last_event_timestamp − first_event_timestamp) per case |
| SLA breach rate | % of cases where resolution_time_hrs > threshold (default: 48h) |
| Escalation rate | % of cases where escalated = True |
| Resolution rate | % of cases that reached 'ticket resolved' activity |
| Throughput | Resolved cases per day |
Four analyses in one module:
- Activity wait times — avg/P90/max gap before each activity starts
- Slow transitions — ranked table of activity → activity handoff durations
- Rework loops — cases that visit the same activity more than once
- Priority comparison — are Critical tickets faster than Low priority ones?
Identifies every unique process path and its frequency:
- Top variants — ranked frequency table with cycle time per variant
- Ideal vs actual — detects the happy path and measures deviation rate
- Priority/category breakdown — which paths do Critical tickets take?
- Deviating cases — individual cases that didn't follow the ideal path
Nine interactive sections, all filtered by source / priority / category:
📈 KPI Cards → 5 headline metrics
🔄 Process Flow → Sankey diagram of actual flow
🐌 Bottlenecks → Horizontal bar chart of wait times
🧩 Variant Analysis → Ideal vs actual + variant table
📦 Volume Analysis → Priority donut + category bar
⏱ Time Analysis → Cycle time P50/P75/P90/P95 + daily trend
🗓 Heatmap → Events by hour-of-day and weekday
🐢 Slow Transitions → Table of slowest handoffs
⚠️ Deviating Cases → Cases that took unusual paths
| Layer | Technology |
|---|---|
| Language | Python 3.9+ |
| Data processing | Pandas · NumPy |
| Process mining | PM4Py · NetworkX |
| Database | PostgreSQL 13+ |
| ORM / connection | SQLAlchemy · psycopg2 |
| Dashboard | Streamlit 1.35 |
| Visualisation | Plotly |
| API | FastAPI · Uvicorn |
| Config | python-dotenv · PyYAML |
| Layer | Technology |
|---|---|
| ML models | Scikit-learn · XGBoost · LightGBM |
| LLM | Gemini 2.5 Flash · NVIDIA Llama 3.3 70B NIM |
| Frontend | React.js |
| Layer | Technology |
|---|---|
| Vector DB | ChromaDB |
| RAG framework | LangChain |
| Document parsing | PyPDF · python-docx |
| Streaming | Kafka / Webhooks |
Phase 1 uses a synthetic helpdesk ticket dataset (1,000,000 tickets) with the following schema:
| Column | Type | Description |
|---|---|---|
| Ticket_ID | string | Unique ticket identifier |
| Date_Created | datetime | When the ticket was opened |
| Date_Resolved | datetime | When the ticket was closed (null if open) |
| Category | string | Software · Hardware · Network · HR · Security · Access |
| Subcategory | string | Specific issue type |
| Priority | string | Critical · High · Medium · Low |
| Status | string | Open · In Progress · On Hold · Pending · Resolved |
| Assigned_Team | string | Team responsible for the ticket |
| Escalated | boolean | Whether the ticket was escalated |
| Resolution_Time_Hrs | float | Hours from open to close |
Other supported datasets:
- BPI Challenge 2012 — Loan application process (262K events)
- BPI Challenge 2017 — Credit application (1.2M events)
- BPI Challenge 2019 — Purchase order handling (1.5M events)
- Sepsis Cases — Hospital patient pathways (15K events)
- Python 3.9+
- PostgreSQL 13+ running locally
- Git
git clone https://github.com/yourusername/business-process-mining.git
cd business-process-miningpip install pandas==2.2.2 numpy==1.26.4 sqlalchemy==2.0.30 psycopg2-binary==2.9.9 \
python-dotenv==1.0.1 pyyaml==6.0.1 openpyxl==3.1.2 pm4py==2.7.11 \
networkx==3.3 fastapi==0.111.0 uvicorn==0.30.1 pydantic==2.7.1 \
python-multipart==0.0.9 streamlit==1.35.0 plotly==5.22.0cp .env.example .env
# Edit .env with your PostgreSQL credentialsDB_USER=postgres
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432
DB_NAME=process_mining_dbpsql -U postgres -c "CREATE DATABASE process_mining_db;"
psql -U postgres -d process_mining_db -f database/schema.sql# Copy your helpdesk CSV to:
data/raw/helpdesk_tickets.csvpython -m etl.pipelineExpected output:
[ETL] Extracted 1,000,000 raw rows
[ETL] Validation passed
[ETL] Output events: 2,499,911
[ETL] Loaded 2,499,911 rows, 1,000,000 cases
python -m process_mining.discovery # verify process discovery
python -m analytics.kpis # verify KPI computation
python -m analytics.bottlenecks # verify bottleneck detection
python -m process_mining.variants # verify variant analysisstreamlit run dashboard/app.py
# Opens at http://localhost:8501uvicorn api.main:app --reload --port 8000
# Swagger UI at http://localhost:8000/docsThe pipeline runs in four sequential steps:
Extract → reads raw file (CSV / XES / Excel / JSON)
↓
Validate → checks required columns, nulls, row count, date format
↓
Transform → renames columns, parses timestamps, explodes tickets
into events, computes wait_time_mins + cycle_time_mins
↓
Load → bulk-writes to PostgreSQL in 50,000-row chunks
using a staging table for the cases upsert (fast)
Re-running the pipeline on the same file is safe — existing data is not duplicated. The etl_runs table records every execution with row counts, case counts, duration, and status for auditability.
The dashboard reads entirely from PostgreSQL — it never touches a raw file. Every section recomputes instantly when the sidebar filters change. Results are cached per filter combination using @st.cache_data.
Sidebar filters:
- Dataset (source)
- Priority: All · Critical · High · Medium · Low
- Category: All · Software · Hardware · Network · HR
- Top N variants (slider)
- Min edge count for process flow chart (slider)
Once the FastAPI server is running, all data is available as REST endpoints:
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/kpis |
All KPIs with optional filters |
| GET | /api/variants |
Top process variants |
| GET | /api/graph |
DFG edges for process flow chart |
| GET | /api/cases |
Paginated case list |
| GET | /api/bottlenecks |
Bottleneck activity ranking |
| POST | /api/etl/run |
Trigger ETL pipeline run |
Interactive documentation at http://localhost:8000/docs
- ETL pipeline (CSV · XES · PostgreSQL)
- Process discovery engine (DFG · Heuristic Miner)
- KPI engine (cycle time · SLA · escalation · throughput)
- Bottleneck detection (wait times · transitions · rework)
- Variant analysis (ideal vs actual · priority breakdown)
- Streamlit dashboard (9 interactive sections)
- FastAPI backend (REST endpoints)
- Delay prediction model (XGBoost)
- SLA breach prediction (LightGBM)
- AI recommendation engine (Gemini 2.5 Flash)
- LLM root cause analysis
- Executive summary generation
- React.js frontend
- Process Copilot (conversational Q&A)
- RAG knowledge assistant (SOPs · policies · ChromaDB)
- Real-time monitoring (webhooks · Kafka)
- Multi-agent process optimizer
- Digital twin simulation
- Real-time process monitoring — live KPI alerts when SLA thresholds are crossed
- Digital twin simulation — what-if scenario modelling before implementing changes
- Multi-agent optimization — autonomous agents that identify and propose process improvements
- Enterprise connectors — direct integration with SAP, Salesforce, ServiceNow, Jira
- Multi-tenant SaaS — support multiple organisations with isolated data and dashboards
process-mining business-intelligence etl-pipeline data-analytics
pm4py streamlit fastapi postgresql python plotly
machine-learning generative-ai llm rag chromadb
bottleneck-detection kpi-dashboard variant-analysis event-log
enterprise-software
MIT License — see LICENSE for details.
Built as a portfolio project demonstrating expertise in Business Analysis, Process Mining, Data Analytics, Machine Learning, GenAI, and Enterprise Software Development.