Data Science - Exploration and Classification Project
Binary cybersecurity incident classification using Microsoft's GUIDE dataset. Predicts BinaryIncidentGrade (0=Non-TP, 1=TP) from hierarchical security evidence.
| Model | Accuracy | Precision | Recall | F1 Score | ROC AUC |
|---|---|---|---|---|---|
| XGBoost | 0.8019 | 0.8274 | 0.7888 | 0.8076 | 0.9061 |
| Random Forest v2 | 0.7919 | 0.8252 | 0.7681 | 0.7956 | 0.8992 |
| Decision Tree | 0.7868 | 0.8077 | 0.7819 | 0.7946 | 0.8854 |
| MLP (Sklearn) | 0.7679 | 0.8101 | 0.7311 | 0.7686 | 0.8720 |
Dataset Stats: 9.5M evidence records β 1.6M alerts β 1M incidents | 441 MITRE ATT&CK techniques | 33 entity types
notebook/ # All analysis & modeling code (PRIMARY LOCATION)
βββ guide_utils.py # Core preprocessing utilities
βββ 1-Advanced_EDA.ipynb # Initial exploration & stats
βββ 2-FeatureEngineering.ipynb # 23 aggregated features & Smoothed Risk
βββ 3-FeatureEngineering_Pipeline.ipynb # Pipeline with anti-leakage split
βββ 4-Model_Training_and_Comparison.ipynb # Tuning (CV) & Real-world eval
βββ exploration/
β βββ analisi_mitre_preprocessing.ipynb # MITRE one-hot encoding
β βββ Initial_EDA.ipynb
βββ tests/ # Model-specific development & tests
models/ # Trained models & metrics (Jan 2026 Revision)
docs/ # Documentation
βββ methodology.md # Design decisions & Rationale
βββ CHANGELOG.md # Critical revision details
βββ Classificazione.pdf # Technical document
- Install requirements:
pip install -r requirements.txt - Run notebooks in
notebook/in order (1 to 4). - The final model comparison and evaluation is in
4-Model_Training_and_Comparison.ipynb.
Hierarchical aggregation (Evidence β Incident level):
- Target:
max(Incident is TP if any evidence is TP) - Aggregations:
nuniquefor Alerts/Countries,countfor evidences,mean/maxfor Risk scores. - Temporal:
Duration_seconds,Hour_mean,IsWeekend.
Advanced Encoding:
- Smoothed Risk Score: Bayesian Target Encoding on
AlertTitlewith smoothing (Ξ±=5) to prevent overfitting on rare categories. - Frequency Encoding: Applied to high-cardinality categorical features (
GeoLoc,EntityType).
MITRE processing:
- Parse semicolon-separated techniques.
- Top 20 techniques (by frequency) β one-hot encoded.
MitreCountcolumn for total techniques per incident.
| Function | Purpose |
|---|---|
load_guide_dataset(path, sample_frac=0.1) |
Load with memory-efficient sampling |
full_preprocessing_pipeline(path) |
Raw CSV β modeling-ready X, y |
extract_temporal_features(df) |
Hour, DayOfWeek, IsWeekend, Duration |
parse_mitre_techniques(df) |
Count & indicator for MITRE codes |
create_aggregated_features(df) |
EvidenceβIncident aggregations |
prepare_for_modeling(df, target_col) |
Split X/y, drop IDs, stratified split |
- Parameters:
max_depth=10,learning_rate=0.1,n_estimators=300,subsample=0.9 - Evaluation: 5-fold stratified CV (Best CV F1: 0.8679)
- Top Features (from
feature_importance.csv):
SmoothedRisk_avg(0.30 gain)T1078_sum(0.12 gain)EvidenceRole_Related_sum(0.06 gain)GeoLoc_freq_avg(0.05 gain)EntityType_freq_mean(0.04 gain)
n_estimators=150,max_depth=20,min_samples_split=5- F1 Score (TP): 0.7956
max_depth=15,min_samples_split=20- F1 Score (TP): 0.7946
- Sklearn: 2 hidden layers (128-64), ReLU, Early Stopping.
- PyTorch: 3 layers (128-64-32), Dropout, BatchNorm (F1: 0.7651).
- F1 Score: 0.7686 (Sklearn)
- Anti-Leakage Split: Data is split at the Incident level, ensuring all evidences of a single incident stay within the same fold (Train or Val/Test).
- Hyperparameter Tuning: All models tuned using Stratified 5-fold Cross-Validation via
RandomizedSearchCV. - Missing Values:
MitreTechniquesβ Top 20 OHE + Count.SuspicionLevelβSuspicionLevel_IsMissingindicator.- Use median/mode imputation or constant indicators.
- Classificazione.pdf - Technical document about our's work
- methodology.md - Feature motivation and methodology adopted
Primary Metric: F1 Score (TP) and ROC AUC
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_test, y_pred, target_names=['Non-TP', 'TP']))
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba)}")Performance Analysis (XGBoost v2):
- ROC AUC: 0.9061 (Excellent discrimination between threats and noise)
- Recall: 0.79 (Detects ~79% of actual threats)
- Precision: 0.83 (When it alerts, it's correct 83% of the time)
When adding features:
- Aggregate to Incident level in
2-FeatureEngineering.ipynb - Update
guide_utils.create_aggregated_features() - Re-run balancement and model training notebooks
- Compare metrics in
9-ModelComparison.ipynb
Dataset: Microsoft GUIDE (Guided Response Dataset) - Public research dataset
