Dora the Data Explorer 🔍

Data Science - Exploration and Classification Project

Binary cybersecurity incident classification using Microsoft's GUIDE dataset. Predicts BinaryIncidentGrade (0=Non-TP, 1=TP) from hierarchical security evidence.

🎯 Performance

Model	Accuracy	Precision	Recall	F1 Score	ROC AUC
XGBoost	0.8019	0.8274	0.7888	0.8076	0.9061
Random Forest v2	0.7919	0.8252	0.7681	0.7956	0.8992
Decision Tree	0.7868	0.8077	0.7819	0.7946	0.8854
MLP (Sklearn)	0.7679	0.8101	0.7311	0.7686	0.8720

Dataset Stats: 9.5M evidence records → 1.6M alerts → 1M incidents | 441 MITRE ATT&CK techniques | 33 entity types

📁 Project Structure

notebook/          # All analysis & modeling code (PRIMARY LOCATION)
├── guide_utils.py              # Core preprocessing utilities
├── 1-Advanced_EDA.ipynb        # Initial exploration & stats
├── 2-FeatureEngineering.ipynb   # 23 aggregated features & Smoothed Risk
├── 3-FeatureEngineering_Pipeline.ipynb  # Pipeline with anti-leakage split
├── 4-Model_Training_and_Comparison.ipynb # Tuning (CV) & Real-world eval
├── exploration/
│   ├── analisi_mitre_preprocessing.ipynb  # MITRE one-hot encoding
│   └── Initial_EDA.ipynb
└── tests/                       # Model-specific development & tests

models/            # Trained models & metrics (Jan 2026 Revision)

docs/              # Documentation
├── methodology.md       # Design decisions & Rationale
├── CHANGELOG.md         # Critical revision details
└── Classificazione.pdf  # Technical document

🚀 Quick Start

Install requirements: pip install -r requirements.txt
Run notebooks in notebook/ in order (1 to 4).
The final model comparison and evaluation is in 4-Model_Training_and_Comparison.ipynb.

📊 Feature Engineering

Hierarchical aggregation (Evidence → Incident level):

Target: max (Incident is TP if any evidence is TP)
Aggregations: nunique for Alerts/Countries, count for evidences, mean/max for Risk scores.
Temporal: Duration_seconds, Hour_mean, IsWeekend.

Advanced Encoding:

Smoothed Risk Score: Bayesian Target Encoding on AlertTitle with smoothing (α=5) to prevent overfitting on rare categories.
Frequency Encoding: Applied to high-cardinality categorical features (GeoLoc, EntityType).

MITRE processing:

Parse semicolon-separated techniques.
Top 20 techniques (by frequency) → one-hot encoded.
MitreCount column for total techniques per incident.

🛠️ Utilities (`guide_utils.py`)

Function	Purpose
`load_guide_dataset(path, sample_frac=0.1)`	Load with memory-efficient sampling
`full_preprocessing_pipeline(path)`	Raw CSV → modeling-ready X, y
`extract_temporal_features(df)`	Hour, DayOfWeek, IsWeekend, Duration
`parse_mitre_techniques(df)`	Count & indicator for MITRE codes
`create_aggregated_features(df)`	Evidence→Incident aggregations
`prepare_for_modeling(df, target_col)`	Split X/y, drop IDs, stratified split

📈 Model Specifications

XGBoost

Parameters: max_depth=10, learning_rate=0.1, n_estimators=300, subsample=0.9
Evaluation: 5-fold stratified CV (Best CV F1: 0.8679)
Top Features (from feature_importance.csv):

SmoothedRisk_avg (0.30 gain)
T1078_sum (0.12 gain)
EvidenceRole_Related_sum (0.06 gain)
GeoLoc_freq_avg (0.05 gain)
EntityType_freq_mean (0.04 gain)

Random Forest

n_estimators=150, max_depth=20, min_samples_split=5
F1 Score (TP): 0.7956

Decision Tree

max_depth=15, min_samples_split=20
F1 Score (TP): 0.7946

MLP

Sklearn: 2 hidden layers (128-64), ReLU, Early Stopping.
PyTorch: 3 layers (128-64-32), Dropout, BatchNorm (F1: 0.7651).
F1 Score: 0.7686 (Sklearn)

🎓 Key Insights

Anti-Leakage Split: Data is split at the Incident level, ensuring all evidences of a single incident stay within the same fold (Train or Val/Test).
Hyperparameter Tuning: All models tuned using Stratified 5-fold Cross-Validation via RandomizedSearchCV.
Missing Values:

MitreTechniques → Top 20 OHE + Count.
SuspicionLevel → SuspicionLevel_IsMissing indicator.
Use median/mode imputation or constant indicators.

📖 Documentation

Classificazione.pdf - Technical document about our's work
methodology.md - Feature motivation and methodology adopted

🔍 Evaluation

Primary Metric: F1 Score (TP) and ROC AUC

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred, target_names=['Non-TP', 'TP']))
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba)}")

Performance Analysis (XGBoost v2):

ROC AUC: 0.9061 (Excellent discrimination between threats and noise)
Recall: 0.79 (Detects ~79% of actual threats)
Precision: 0.83 (When it alerts, it's correct 83% of the time)

🤝 Contributing

When adding features:

Aggregate to Incident level in 2-FeatureEngineering.ipynb
Update guide_utils.create_aggregated_features()
Re-run balancement and model training notebooks
Compare metrics in 9-ModelComparison.ipynb

📄 License

Dataset: Microsoft GUIDE (Guided Response Dataset) - Public research dataset

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github		.github
docs		docs
examples		examples
notebook		notebook
.gitignore		.gitignore
Classificazione.pdf		Classificazione.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dora the Data Explorer 🔍

🎯 Performance

📁 Project Structure

🚀 Quick Start

📊 Feature Engineering

🛠️ Utilities (`guide_utils.py`)

📈 Model Specifications

XGBoost

Random Forest

Decision Tree

MLP

🎓 Key Insights

📖 Documentation

🔍 Evaluation

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dora the Data Explorer 🔍

🎯 Performance

📁 Project Structure

🚀 Quick Start

📊 Feature Engineering

🛠️ Utilities (guide_utils.py)

📈 Model Specifications

XGBoost

Random Forest

Decision Tree

MLP

🎓 Key Insights

📖 Documentation

🔍 Evaluation

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🛠️ Utilities (`guide_utils.py`)

Packages