Author: Smriti Goyal (github.com/SmritiGoyal)
End-to-end production-style pipeline that predicts repair turn-around time (RTAT) for a Fortune 500 appliance manufacturer's US service network, then translates predictions into segment-level resource allocation recommendations across four operational levers.
Repair turn-around time prediction sits at the intersection of operations research and applied machine learning, and it's harder than it first appears for three reasons:
- Operational variability is structural, not noise. Repair durations depend on engineer availability, parts logistics, channel routing rules, and customer responsiveness — five mostly-independent systems that each have their own failure modes. A naive average over the cohort hides 40-day swings between adjacent segments. The model needs to capture interactions across operational dimensions, not just main effects.
- Leakage hides in process timing, not just in target columns. Some features are populated at intake (safe), some during repair (only safe if the model is being scored mid-repair), some at close (post-hoc, unsafe for any forward-looking prediction). The pipeline introduces an explicit timing classification — INTAKE / TRAINING / POST_INTAKE / MEDIUM / POST_CLOSE — to keep that discipline mechanical rather than mental. The five-test automated audit catches many subtle violations human review misses — though not all: a fold-level encoder leak slipped past it and was caught by a separate fold-scoping ablation (see Key Decision 4).
- Predictions only matter if they translate to interventions. A 4.58-day MAE is operationally meaningless without saying which 28 Market_Category × Channel segments to act on and which of four operational levers to pull. The downstream lever decomposition is the deliverable; the model is the input.
This project addresses all three within a reproducible single-machine pipeline.
The pipeline processes 2.19M repair records across 4 source years, joins three operational tables (master records, parts ledger, callback records), engineers 41 leakage-audited features, trains and compares 15 candidate models (7 classifiers + 8 regressors), and produces a prioritized action plan for the worst-performing service segments.
Validated on a strict, locked 2026 out-of-time holdout using a select-then-refit protocol — select on 2023-2024 → 2025, refit on all 2023-2025, score 2026 exactly once:
| Metric | Value | Note |
|---|---|---|
| Holdout AUC at T=5 | 0.807 | LightGBM (XGBoost 0.812 — tied); 41 features, 2026 holdout |
| Holdout F1 at T=5 | 0.729 | At decision threshold 0.50 |
| Holdout MAE | 4.58 days | 32% improvement over the 6.77-day mean baseline |
| Holdout R² | 0.36 | |
| Holdout cohort size | 70,250 repairs | 2026 (Jan-Apr) — locked, scored once |
| Training cohort (selection) | 1,060,649 repairs | 2023-2024 |
| Training cohort (final refit) | 1,570,579 repairs | 2023-2025 |
| Validation cohort | 509,930 repairs | 2025 |
The model is selected on the strict 2023-2024 → 2025 split (T=5 validation AUC ~0.77), then refit on all 2023-2025 and scored once on the 2026 holdout. The holdout AUC sits slightly above the selection-fold validation because the final model trains on an extra year of data; generalization is evidenced by consistent performance across all four thresholds and by the NPS external check, not by a single tight validation-to-holdout gap.
These numbers are reproducible from this code on the source data. The lever decomposition shows the top operational priorities split across four levers — 43% engineer deployment, 29% parts logistics, 18% channel process, 11% repair complexity — concentrated in 7 Market_Category × Channel segments that appear in the top 10 across every operating threshold (T=3, 5, 7, 10).
The choices below are the ones that drove the result. Each came from a measured failure of the alternative or an explicit constraint in the data.
Train on 2023-2024, validate on 2025, hold out 2026 — used exactly once at the end. A random split would let the model see future engineer assignments, future parts logistics shifts, and future seasonal patterns — all of which leak in subtle ways the validation log loss wouldn't catch.
The discipline is enforced by fold-scoped encoder fitting and a select-then-refit protocol: every training-class aggregate is fit on the 2023-2024 fold for model selection, then refit on all 2023-2025 for the final model, which scores the 2026 holdout once.
This matters because an earlier version of the pipeline fit those aggregates — engineer historical mean (the model's strongest feature), tier/channel/division/month means, city/state target encoding, and the segment delivery median — on the full 2023-2025 cohort and then split into train/validation without refitting. That let 2025 information reach the validation rows, inflating the validation metrics and the apparent validation-to-holdout gap. The fix scopes every aggregate to data available before the rows it serves; the holdout numbers in this README are leak-free.
Every feature gets one of five labels: INTAKE, TRAINING, POST_INTAKE, MEDIUM, POST_CLOSE. Only INTAKE and TRAINING-class features are deployment-safe for a model scored at repair intake. The MEDIUM-class features (is_ter_repair, is_sealed_repair, eng_channel_risk) are flagged for operational confirmation rather than silently included or excluded — the methodology surfaces the assumption rather than burying it. This taxonomy is enforced by the automated audit, so timing discipline doesn't depend on remembering it during feature engineering.
The 41 features split into two model-specific subsets. CORE (27 features) has 0% missingness and is safe for linear models after median fill. EXTENDED (37 features) adds DMS-dependent features for tree models that handle nulls natively — those features have ~75% missingness because only ~28% of repairs route parts through DMS. This isn't a stylistic choice: linear models on EXTENDED produce worse holdout performance because median fill on 75%-null features destroys signal; tree models on CORE leave money on the table. The split is informed by both validation metrics and operational deployability.
Human review catches obvious leakage (target columns, hash of identifier) and misses subtle leakage (target encoding without proper folding, future-data proxies, train-vs-holdout distribution shift). The audit runs five mechanical tests on every feature: correlation with target above class-conditional levels, perfect-predictor flag, target-encoding folding check, future-data proxy detection, train-vs-holdout stability. Of 41 features, 33 pass clean; 8 are flagged for human review with explicit reasons.
What the audit does not catch — and a leak it missed. The audit compares training against the 2026 holdout, so it is blind to a leak confined to the validation fold. The fold-level encoder leak described in Decision 1 passed all five tests — the leaked aggregates were "train-only" relative to 2026, so the train-vs-holdout comparison saw nothing wrong, even while 2025 was bleeding into validation. It was caught by a separate fold-scoping ablation, not by the audit. A defensible v2 audit needs a sixth test: refit every encoder per fold and assert the validation metric is stable.
Baseline → linear (Ridge, Lasso) → tree (depth-6) → ensemble (RF) → boosting (XGBoost, LightGBM), with the lift at each stage justifying the production choice rather than asserting it. On validation, performance improves monotonically from the mean baseline (MAE 6.77 days) down through the tree models, which cluster tightly (RF 4.93, XGBoost 4.92, LightGBM 4.95). On the final 2026 holdout, LightGBM and XGBoost land within ~0.005 AUC of each other at every threshold (T=5: 0.807 vs 0.812) — effectively tied. LightGBM is retained as primary for regression stability, ~25% faster training, and native categorical handling; XGBoost is kept as a validated equal-performance alternative. (Hyperparameters were not re-tuned after the leak fix, so single-fit LightGBM underfits at default iterations in the validation bake-off — see "What I would do differently.")
A model that predicts which repairs will run late is useless without saying what to do about it. The downstream prioritization layer scores every Market_Category × Channel segment (≥500 repairs, 28 segments total) against four operational levers — engineer deployment, parts logistics, channel process, repair complexity — using a deterministic scoring rule per lever. Each segment gets a primary and secondary lever assignment. This is the business deliverable; the model is the input.
NPS responses (Promoter / Passive / Detractor) were explicitly excluded from the training feature set. Including them would have been target leakage — NPS scores correlate strongly with repair duration. Using them post-hoc to validate the model's business relevance yields a 5.2-point promoter rate gap between the lowest-risk and highest-risk predicted buckets (computed out-of-sample), confirming the model addresses a real customer-experience problem rather than just an internal operational metric.
rtat-optimization/
├── src/
│ ├── ingestion.py Step 1-4: Excel → integrated parquet
│ ├── eda.py Step 5: first-pass EDA + hypothesis list
│ ├── feature_engineering.py Step 6: 41 model features + leakage review
│ ├── modeling.py Step 7: 15 candidate models + comparison
│ ├── prioritization.py Step 8: priority matrix + 4-lever decomposition
│ ├── leakage_audit.py Step 9: 5-test feature audit
│
├── data/
│ ├── raw/ (gitignored — client data)
│ └── README.md Schema documentation
│
├── outputs/ (gitignored — regenerable)
│ ├── interim/ Ingestion artifacts
│ ├── eda/ Stats tables + charts
│ ├── features/ Feature parquets + documentation
│ ├── models/ Trained models + comparison tables
│ ├── prioritization/ Priority matrix + lever decomposition
│ └── README.md Artifact catalog
│
├── docs/
│ ├── methodology.md Extended technical writeup
│ └── deck.md Stakeholder presentation (markdown)
│
├── .gitignore
├── requirements.txt
├── LICENSE MIT
└── README.md (this file)
Each of the six pipeline files is independently runnable. Each consumes its predecessor's parquet/CSV artifacts and produces its own. No shared state, no orchestrator file — the file structure is the orchestration.
| Step | File | Input | Output | Approx runtime |
|---|---|---|---|---|
| 1-2 | ingestion.py § 5-6 |
Raw Excel | Inventory + key validation CSVs | 15-30 min |
| 3 | ingestion.py § 4 |
Master Excel | Cohort summary CSVs | 5 min |
| 4A-C | ingestion.py § 5-7 |
All three Excel sources | master_train.parquet, master_holdout.parquet |
15-25 min |
| 5 | eda.py |
master_train.parquet |
11 CSV tables + 10 PNG charts + hypothesis list | 1 min |
| 6 | feature_engineering.py |
Master parquets | feature_train.parquet, feature_holdout.parquet (+ a 2023-2025-fit feature_train_final.parquet / feature_holdout_final.parquet for the final refit) + 4 doc CSVs |
<1 min |
| 7 | modeling.py |
Feature parquets | model pickles + result CSVs (incl. final refit holdout) | 3-5 min |
| 8 | prioritization.py |
Models + features | 7 priority CSVs | <1 min |
| 9 | leakage_audit.py |
Features + models | leakage_audit.csv |
<1 min |
The dominant cost is Step 4B (parts ledger streaming) at 15-25 minutes — Excel I/O on ~3M rows. Everything after ingestion runs against parquets and completes in a few minutes combined.
git clone https://github.com/SmritiGoyal/rtat-optimization.git
cd rtat-optimization
python -m venv .venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windows
pip install -r requirements.txtThis pipeline was developed against client-confidential field service data and the raw data is not included in this repository. To run end-to-end you would need access to source files matching the documented schema. See data/README.md for the full schema specification, including the seven service channels, the cohort filter rules, and the reclaim flag columns.
Run the stages in order:
python ingestion.py # Steps 1-4: produces master_train.parquet, master_holdout.parquet
python eda.py # Step 5: first-pass EDA, hypothesis list
python feature_engineering.py # Step 6: 41 features (fold-scoped + final refit), leakage review
python modeling.py # Step 7: classifiers + regressors, threshold sweep, final holdout
python prioritization.py # Step 8: priority matrix, 4-lever decomposition
python leakage_audit.py # Step 9: 5-test feature auditEach file is independently re-runnable. Re-running any stage overwrites that stage's outputs with deterministic identical results given the same inputs.
| Model | Val MAE (days) | Val RMSE | Val R² |
|---|---|---|---|
| 1. Mean baseline | 6.77 | 10.98 | -0.001 |
| 2. Segment mean (tier × channel) | 5.97 | 10.19 | 0.138 |
| 3. Ridge (L2) | 5.39 | 9.47 | 0.255 |
| 4. Lasso (L1) | 5.37 | 9.47 | 0.256 |
| 5. Decision Tree (depth=6) | 5.15 | 9.36 | 0.272 |
| 6. Random Forest | 4.93 | 9.13 | 0.309 |
| 7. XGBoost | 4.92 | 9.06 | 0.319 |
| 8. LightGBM | 4.95 | 9.18 | 0.301 |
Lasso zeroed 2 of 27 CORE features at α=0.01 — modest regularization. On validation regression the tree models cluster tightly and XGBoost (4.92) edges LightGBM (4.95) within noise. The final LightGBM model, refit on all 2023-2025, reaches 4.58 MAE on the 2026 holdout (RMSE 8.36, R² 0.36).
| Model | Val AUC | Val F1 | Val Precision | Val Recall |
|---|---|---|---|---|
| 1. Majority class | — | 0.000 | 0.000 | 0.000 |
| 2. Logistic Regression (L2) | 0.761 | 0.683 | 0.649 | 0.720 |
| 3. Decision Tree (d=3, interpretable) | 0.738 | 0.697 | 0.621 | 0.793 |
| 4. Decision Tree (d=6) | 0.762 | 0.689 | 0.646 | 0.738 |
| 5. Random Forest | 0.788 | 0.713 | 0.650 | 0.789 |
| 6. XGBoost | 0.796 | 0.713 | 0.677 | 0.754 |
| 7. LightGBM | 0.759 | 0.686 | 0.669 | 0.705 |
On this single-fit validation bake-off XGBoost leads; LightGBM underfits at default iterations (best_iter ~29) because the hyperparameters were not re-tuned after the leak fix. Per-threshold and on the final holdout the two are tied (see below), and LightGBM is retained as primary for regression stability and faster training.
| Threshold | Selection val AUC (2023-24 model) | Final holdout AUC | Holdout F1 | Notes |
|---|---|---|---|---|
| T=3 (strict) | 0.749 | 0.787 | 0.232 | 27.9% positive, is_unbalance=True |
| T=5 (primary) | 0.759 | 0.807 | 0.729 | 46.8% positive, balanced |
| T=7 (relaxed) | 0.802 | 0.827 | 0.842 | 62.3% positive |
| T=10 (loose) | 0.826 | 0.858 | 0.903 | 75.4% positive, is_unbalance=True |
Holdout regression MAE: 4.58 days (RMSE 8.36, R² 0.36). The holdout AUC sits above the selection-fold validation at every threshold because the final model is refit on the extra 2025 year before scoring 2026 — the consistent across-threshold pattern, not a single tight gap, is the generalization evidence. XGBoost is within ~0.005 AUC of LightGBM at every threshold (T=5 holdout 0.812).
Of 41 features in the production model, the automated 5-test audit returns:
| Verdict | Count | Examples |
|---|---|---|
| ✓ CLEAN | 33 | All TRAINING and INTAKE-class features |
| ⚠ CONFIRM WITH OPS | 3 | is_ter_repair, is_sealed_repair, eng_channel_risk — MEDIUM timing class |
| ⚠ REVIEW | 2 | parts_has_arrival_flag, parts_delivery_tier — POST_CLOSE class |
| ⚠ STABILITY FLAG | 3 | month_of_year, quarter, plus one feature where 2026 holdout (partial year) shifts distribution from 2023-2025 training |
The three "STABILITY FLAG" features are flagged for human review, not for removal — the train-vs-holdout shift is by design (the holdout is a temporal slice covering Jan-Apr 2026, while training spans full calendar years). The two "REVIEW" features are documented as having uncertain assignment timing; the methodology explicitly uses seg_delivery_days_hist (a training-cohort segment median) as the deployment-safe substitute. See docs/methodology.md for the full leakage discipline.
These verdicts are unchanged by the encoder fix — and that is precisely the audit's blind spot: it compares training against the 2026 holdout and cannot detect a leak confined to the 2025 validation fold (see Key Decision 4). Test 5 (engineer historical proxy stability across train and holdout) returned a 0.74-day gap, well within the 1.0-day tolerance, confirming the engineer encoder is stable between training and the 2026 holdout — though, by construction, this test does not detect a validation-fold leak.
The downstream business deliverable. Every Market_Category × Channel segment with ≥500 repairs (28 segments total) is scored against four operational levers and assigned a primary and secondary lever:
| Lever | Segments | Share | Scoring rule |
|---|---|---|---|
| Engineer deployment | 12 | 43% | Q4 engineer rate vs population (×1.30 strong, ×1.10 weak) |
| Parts logistics | 8 | 29% | Parts rate × 1.15 + delivery days × 1.30 |
| Channel process | 5 | 18% | max(0, channel_risk_ordinal - 3) |
| Repair complexity | 3 | 11% | Sealed rate × 1.30 + reclaim rate × 1.30 |
Top recommendation (Urban × DMS) and the full ranking are in outputs/prioritization/final_recommendation.csv.
NPS responses (Promoter / Passive / Detractor) were explicitly excluded from features to avoid target leakage. Using them only for post-hoc validation — 2025 NPS responders scored out-of-sample by the 2023-2024 selection model (which never saw 2025) — predicted-risk bucket tracks NPS sentiment in the right direction:
| Predicted risk | Promoter rate | Detractor rate |
|---|---|---|
| Very low (0-20%) | 62.6% | 25.0% |
| Very high (80-100%) | 57.4% | 28.8% |
| Gap | 5.2pp | 3.8pp |
Predicted lateness drives a 5.2-point promoter gap and 3.8-point detractor gap — confirming RTAT optimization addresses a real customer-experience problem, not just an operational metric. The signal cannot be tuned from inside the pipeline because NPS is never a feature. (An earlier build reported a wider gap; that version was inflated by the encoder leak since corrected. Full breakdown across all five buckets: outputs/prioritization/nps_validation.csv, 42,451 responders.)
docs/methodology.md— full methodology writeup: cohort filter, target encoding with smoothing (k=30), engineer historical mean computation discipline, two-track CORE/EXTENDED feature design, hyperparameter selection, leakage audit philosophy and its blind spotdocs/deck.md— stakeholder presentation (markdown source)data/README.md— full schema specification with channel glossary and cohort filter rulesoutputs/README.md— artifact catalog: every output file by stage with description
The 41 features split into two model-specific subsets:
- CORE (27 features): 0% missing, safe for linear models after median fill. Excludes DMS-dependent features (parts line count, order quantity, multi-line flag, etc.) which have ~75% missingness because only ~28% of repairs route parts through DMS.
- EXTENDED (37 features in MODEL_FEATURES): adds DMS-dependent features for tree models that handle missing values natively. Excludes
parts_order_to_arrival_days_safe(training-EDA only — the deployment-safe substitute isseg_delivery_days_hist).
This isn't a stylistic choice. Linear models on the EXTENDED set produce worse holdout performance because median fill on 75%-null features destroys signal; tree models on the CORE set leave money on the table. The design choice is informed by both validation metrics and operational deployability.
Honest reflections after building this:
- Re-tune hyperparameters after the leak fix. The current LightGBM/XGBoost settings were chosen before the fold-scoped encoder fix; single-fit LightGBM now underfits at default iterations. A fresh Optuna sweep on the clean features would likely recover a few points and is the highest-value next step.
- A sixth audit test for fold-level leaks. The 5-test audit missed a validation-fold encoder leak because it only compares training against the 2026 holdout. Adding a per-fold encoder-refit stability check would catch the class of leak that actually occurred here.
- Stronger time-series cross-validation. The current setup splits 2023-2024 → train, 2025 → val, 2026 → holdout. A rolling expanding-window CV across all training quarters would give a tighter estimate of expected drift.
- Operational confirmation of MEDIUM-timing features. The three features flagged by the audit (
is_ter_repair,is_sealed_repair,eng_channel_risk) currently rely on documented (but not operationally confirmed) intake-time availability. A 1-hour meeting with the ops team would either ratify the assumption or surface a real leakage. - Model monitoring plan. This codebase trains and validates a model. A production deployment would need a drift detector on
engineer_hist_mean_rtatdistributions, weekly KS tests on incoming features, and a reclaim-rate monitor (since reclaims feed back into next month's training data). - Per-segment models. The current model is monolithic. For the top 5-10 priority segments, a specialized per-segment model might outperform — at the cost of operational complexity.
- Cost-weighted thresholds. T=5 is the chosen operating point on classifier-balance grounds. A real deployment would weight false-positives (over-allocated resources) against false-negatives (missed delays causing NPS hits) with actual business costs.
MIT — see LICENSE.