Static malware detection for Windows PE files using machine learning. No execution, no sandbox. Just the file.
Parses a PE file, extracts a 2381-dimensional feature vector using EMBER v2, and runs it through a LightGBM classifier to predict whether the file is malware or benign.
| Category | Dimensions |
|---|---|
| Byte Histogram | 256 |
| Byte Entropy | 256 |
| String Features | 104 |
| General File Info | 10 |
| Header Features | 62 |
| Section Features | 255 |
| Import Features | 1280 |
| Export Features | 128 |
| Data Directories | 30 |
| Total | 2381 |
Tested on 200,000 samples (100k benign, 100k malware).
| Metric | Score |
|---|---|
| Accuracy | 0.9596 |
| Precision | 0.9540 |
| Recall | 0.9659 |
| F1 Score | 0.9599 |
| AUC-ROC | 0.9920 |
git clone https://github.com/Nubaise/mlw.git
cd mlw
py -3.10 -m venv venv
venv\Scripts\activate
pip install -r requirements.txtDownload EMBER 2018 from https://github.com/elastic/ember and extract into data/ember2018/.
Vectorize the raw features before training:
python -c "import ember; ember.create_vectorized_features('data/ember2018')"Train:
python -m src.trainEvaluate:
python -m src.evaluateScan a file:
python -m src.predict path/to/file.exePython 3.10, LightGBM, EMBER v2, LIEF, scikit-learn, NumPy