Skip to content

DataScience-Golddiggers/Dora-the-Data-Explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dora the Data Explorer πŸ”

Data Science - Exploration and Classification Project

Python PyTorch Huggingface Jupyter CUDA PyCharm Windows macOS License

DDE

Binary cybersecurity incident classification using Microsoft's GUIDE dataset. Predicts BinaryIncidentGrade (0=Non-TP, 1=TP) from hierarchical security evidence.

🎯 Performance

Model Accuracy Precision Recall F1 Score ROC AUC
XGBoost 0.8019 0.8274 0.7888 0.8076 0.9061
Random Forest v2 0.7919 0.8252 0.7681 0.7956 0.8992
Decision Tree 0.7868 0.8077 0.7819 0.7946 0.8854
MLP (Sklearn) 0.7679 0.8101 0.7311 0.7686 0.8720

Dataset Stats: 9.5M evidence records β†’ 1.6M alerts β†’ 1M incidents | 441 MITRE ATT&CK techniques | 33 entity types

πŸ“ Project Structure

notebook/          # All analysis & modeling code (PRIMARY LOCATION)
β”œβ”€β”€ guide_utils.py              # Core preprocessing utilities
β”œβ”€β”€ 1-Advanced_EDA.ipynb        # Initial exploration & stats
β”œβ”€β”€ 2-FeatureEngineering.ipynb   # 23 aggregated features & Smoothed Risk
β”œβ”€β”€ 3-FeatureEngineering_Pipeline.ipynb  # Pipeline with anti-leakage split
β”œβ”€β”€ 4-Model_Training_and_Comparison.ipynb # Tuning (CV) & Real-world eval
β”œβ”€β”€ exploration/
β”‚   β”œβ”€β”€ analisi_mitre_preprocessing.ipynb  # MITRE one-hot encoding
β”‚   └── Initial_EDA.ipynb
└── tests/                       # Model-specific development & tests

models/            # Trained models & metrics (Jan 2026 Revision)

docs/              # Documentation
β”œβ”€β”€ methodology.md       # Design decisions & Rationale
β”œβ”€β”€ CHANGELOG.md         # Critical revision details
└── Classificazione.pdf  # Technical document

πŸš€ Quick Start

  1. Install requirements: pip install -r requirements.txt
  2. Run notebooks in notebook/ in order (1 to 4).
  3. The final model comparison and evaluation is in 4-Model_Training_and_Comparison.ipynb.

πŸ“Š Feature Engineering

Hierarchical aggregation (Evidence β†’ Incident level):

  • Target: max (Incident is TP if any evidence is TP)
  • Aggregations: nunique for Alerts/Countries, count for evidences, mean/max for Risk scores.
  • Temporal: Duration_seconds, Hour_mean, IsWeekend.

Advanced Encoding:

  • Smoothed Risk Score: Bayesian Target Encoding on AlertTitle with smoothing (Ξ±=5) to prevent overfitting on rare categories.
  • Frequency Encoding: Applied to high-cardinality categorical features (GeoLoc, EntityType).

MITRE processing:

  • Parse semicolon-separated techniques.
  • Top 20 techniques (by frequency) β†’ one-hot encoded.
  • MitreCount column for total techniques per incident.

πŸ› οΈ Utilities (guide_utils.py)

Function Purpose
load_guide_dataset(path, sample_frac=0.1) Load with memory-efficient sampling
full_preprocessing_pipeline(path) Raw CSV β†’ modeling-ready X, y
extract_temporal_features(df) Hour, DayOfWeek, IsWeekend, Duration
parse_mitre_techniques(df) Count & indicator for MITRE codes
create_aggregated_features(df) Evidence→Incident aggregations
prepare_for_modeling(df, target_col) Split X/y, drop IDs, stratified split

πŸ“ˆ Model Specifications

XGBoost

  • Parameters: max_depth=10, learning_rate=0.1, n_estimators=300, subsample=0.9
  • Evaluation: 5-fold stratified CV (Best CV F1: 0.8679)
  • Top Features (from feature_importance.csv):
  1. SmoothedRisk_avg (0.30 gain)
  2. T1078_sum (0.12 gain)
  3. EvidenceRole_Related_sum (0.06 gain)
  4. GeoLoc_freq_avg (0.05 gain)
  5. EntityType_freq_mean (0.04 gain)

Random Forest

  • n_estimators=150, max_depth=20, min_samples_split=5
  • F1 Score (TP): 0.7956

Decision Tree

  • max_depth=15, min_samples_split=20
  • F1 Score (TP): 0.7946

MLP

  • Sklearn: 2 hidden layers (128-64), ReLU, Early Stopping.
  • PyTorch: 3 layers (128-64-32), Dropout, BatchNorm (F1: 0.7651).
  • F1 Score: 0.7686 (Sklearn)

πŸŽ“ Key Insights

  1. Anti-Leakage Split: Data is split at the Incident level, ensuring all evidences of a single incident stay within the same fold (Train or Val/Test).
  2. Hyperparameter Tuning: All models tuned using Stratified 5-fold Cross-Validation via RandomizedSearchCV.
  3. Missing Values:
  • MitreTechniques β†’ Top 20 OHE + Count.
  • SuspicionLevel β†’ SuspicionLevel_IsMissing indicator.
  • Use median/mode imputation or constant indicators.

πŸ“– Documentation

πŸ” Evaluation

Primary Metric: F1 Score (TP) and ROC AUC

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred, target_names=['Non-TP', 'TP']))
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba)}")

Performance Analysis (XGBoost v2):

  • ROC AUC: 0.9061 (Excellent discrimination between threats and noise)
  • Recall: 0.79 (Detects ~79% of actual threats)
  • Precision: 0.83 (When it alerts, it's correct 83% of the time)

🀝 Contributing

When adding features:

  1. Aggregate to Incident level in 2-FeatureEngineering.ipynb
  2. Update guide_utils.create_aggregated_features()
  3. Re-run balancement and model training notebooks
  4. Compare metrics in 9-ModelComparison.ipynb

πŸ“„ License

Dataset: Microsoft GUIDE (Guided Response Dataset) - Public research dataset

Releases

No releases published

Packages

 
 
 

Contributors