LLM Pruner

Optimize LLM depth with surgical precision. Pruneren combines state-of-the-art analysis, novel healing algorithms, and rich visualization to make models smaller and faster without sacrificing intelligence.

Features

Multiple Pruning Strategies:
- Ablation Pruning: Sensitivity-based layer removal
- AVSS: Variance analysis for redundancy detection
- PruneMe: Angular distance-based pruning (inspired by Arcee-AI's PruneMe)
- Smart Pruner: Iterative optimization with automatic healing, SmartPruner features a novel self-healing algorithm that iteratively validates every cut in real-time, ensuring maximum parameter reduction without sacrificing cognitive performance.
Rich Evaluation:
- Semantic similarity scoring
- Multi-task holistic benchmarking
- Baseline vs pruned comparisons
Beautiful Visualizations:
- Layer importance heatmaps
- Activation flow analysis
- Token-level semantic consistency
- Benchmark comparison charts
Simple API: Prune models in just a few lines of code!

Quick Start

Installation

# From source
git clone https://github.com/yourusername/llm_pruner.git
cd llm_pruner
pip install -e .

# Or just install requirements
pip install -r requirements.txt

Three Ways to Use

1. Command Line (Easiest!)

llm-pruner \
  --model meta-llama/Llama-3.2-1B \
  --strategy smart \
  --eval-dataset eren23/pruner_eval \
  --output ./pruned_model \
  --visualize

See full CLI docs in USAGE.md.

2. Python API (Most Flexible)

Prune a Model in 30 Seconds

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from llm_pruner import SmartPruner, load_dataset_from_hub

# Load model
MODEL_ID = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="cuda",
    torch_dtype=torch.float16
).eval()

# Load evaluation data
eval_data = load_dataset_from_hub("eren23/pruner_eval")

# Prune!
pruner = SmartPruner(model, tokenizer)
scores = pruner.compute_layer_scores(eval_data)
candidates = pruner.select_layers(scores, threshold=-0.02)
optimized = pruner.optimize_selection(candidates, eval_data, tolerance=0.015)

# Save pruned model
pruned_model, _ = pruner.prune_and_save(optimized, "./pruned_model", MODEL_ID)

See quickstart.py for a complete minimal example.

Usage Examples

Example 1: Ablation Pruning

from llm_pruner import AblationPruner, plot_layer_importance

# Initialize
pruner = AblationPruner(model, tokenizer)

# Compute scores (measures impact of removing each layer)
scores = pruner.compute_layer_scores(eval_data)

# Select layers with minimal impact
layers_to_remove = pruner.select_layers(scores, threshold=-0.02)

# Visualize
plot_layer_importance(scores, save_path="importance.png")

# Export for MergeKit
pruner.export_mergekit_yaml(layers_to_remove, "config.yaml", MODEL_ID)

Example 2: AVSS (Variance-Based)

from llm_pruner import AVSSPruner

# Find layers with lowest variance (least informative)
pruner = AVSSPruner(model, tokenizer)
scores = pruner.compute_layer_scores(eval_data)
layers = pruner.select_layers(scores, ratio=0.25)  # Remove 25% of layers

Example 3: PruneMe (Angular Distance)

from llm_pruner import PruneMePruner

# Find layers with smallest input-output distance (most redundant)
pruner = PruneMePruner(model, tokenizer)
scores = pruner.compute_layer_scores(eval_data)
layers = pruner.select_layers(scores, ratio=0.25)

Example 4: Smart Pruning with Optimization

from llm_pruner import SmartPruner

# Start with ablation candidates, then optimize iteratively
smart_pruner = SmartPruner(model, tokenizer)

# Initial candidates
scores = smart_pruner.compute_layer_scores(eval_data)
candidates = smart_pruner.select_layers(scores, threshold=-0.02)

# Iteratively test and "heal" if performance drops too much
optimized = smart_pruner.optimize_selection(
    candidates, 
    eval_data, 
    tolerance=0.015  # Max 1.5% score drop
)

Example 5: Comprehensive Evaluation

from llm_pruner import ModelEvaluator, HolisticEvaluator

# Simple evaluation
evaluator = ModelEvaluator(model, tokenizer)
score = evaluator.evaluate(eval_data)

# Multi-task benchmark
holistic = HolisticEvaluator(model, tokenizer)
test_data = holistic.load_diverse_data(n_per_task=100)
results = holistic.run_benchmark(test_data)
# Results: {"GSM8k": 0.73, "HellaSwag": 0.68, ...}

Example 6: Visualizations

from llm_pruner import (
    plot_layer_importance,
    plot_activation_comparison,
    plot_semantic_comparison,
    plot_holistic_benchmark
)

# Layer importance heatmap
plot_layer_importance(scores, "Layer Importance")

# Compare activations before/after pruning
plot_activation_comparison(model, tokenizer, "Test prompt", pruned_layers)

# Token-level semantic consistency
plot_semantic_comparison(original_model, pruned_model, tokenizer, "Test prompt")

# Benchmark comparison
plot_holistic_benchmark(results_orig, results_pruned, n_samples=100)

Complete Example

See example.py for a comprehensive demo covering:

All pruning strategies
Evaluation comparisons
All visualization types
Saving and loading pruned models

Evaluation Data

The toolkit supports multiple evaluation datasets:

Load from scratch:

from llm_pruner import load_eval_dataset

eval_data = load_eval_dataset(n=120)
# Samples from: GSM8k, ARC-Easy, SciQ, TruthfulQA, OpenBookQA, TriviaQA

Upload to HuggingFace:

from llm_pruner import upload_dataset_to_hub

upload_dataset_to_hub(eval_data, "username/my-eval-dataset")

Load from HuggingFace:

from llm_pruner import load_dataset_from_hub

eval_data = load_dataset_from_hub("username/my-eval-dataset")

Advanced Features

Manual Pruning

# Get exact control
pruned_model, tokenizer = pruner.prune_and_save(
    layers_to_remove=[5, 10, 15, 20],
    output_dir="./my_pruned_model",
    model_id=MODEL_ID
)

Custom Evaluation

from llm_pruner import ModelEvaluator

# Use your own data
my_eval_data = [
    {"prompt": "Question 1", "reference": "Answer 1"},
    {"prompt": "Question 2", "reference": "Answer 2"},
]

evaluator = ModelEvaluator(model, tokenizer)
score = evaluator.evaluate(my_eval_data)

MergeKit Integration

All pruners can export MergeKit-compatible YAML configs:

pruner.export_mergekit_yaml(
    layers_to_remove=[3, 7, 11],
    yaml_path="prune_config.yaml",
    source_model=MODEL_ID
)

Then use with MergeKit:

mergekit-yaml prune_config.yaml ./output --copy-tokenizer --cuda

Performance Tips

Start Small: Test with a small eval set first (n=20-50)
Use Smart Pruner: It automatically finds the optimal balance
Protect Layers: First 2 and last 2 layers are always kept
Tolerance Tuning: Start with 0.01-0.02 tolerance for SmartPruner
Batch Size: Adjust based on your GPU memory

Pruning Strategies Compared

Strategy	Speed	Accuracy	Use Case
Ablation	⭐⭐	⭐⭐⭐⭐⭐	Best quality, slower
AVSS	⭐⭐⭐⭐⭐	⭐⭐⭐	Fast screening
PruneMe	Fast	High	Balanced
SmartPruner	Slow	Highest	Best results

Example Results

Here's a real example of pruning Llama-3.2-1B (80 evaluation samples):

Model: meta-llama/Llama-3.2-1B (1,235,814,400 parameters)

EXAMPLE 1: Ablation Pruning
- Baseline Score: 0.2036
- Suggested layers to remove: [6, 8]

EXAMPLE 2: AVSS Pruning
- AVSS suggests removing: []  (model too small/uniform)

EXAMPLE 3: PruneMe (Angular Distance)
- PruneMe suggests removing: [11, 12, 13]

EXAMPLE 4: Smart Pruning (Optimization)
- Candidates: [6, 8]
- Tolerance: 1.5% drop allowed
- Baseline Score: 0.2036
- Optimized removal list: [6]
- Final Score: 0.2499 (Baseline: 0.2036)

EXAMPLE 5: Apply Pruning
- Baseline params: 1,235,814,400
- Pruned params:   1,174,992,896
- Reduction:       4.9%
- Baseline score:  0.2036
- Pruned score:    0.2499
- Delta:           +0.0464  (IMPROVEMENT!)

Total processing time: ~7 minutes on T4 GPU

The Smart Pruner successfully removed layer 6, resulting in a 4.9% parameter reduction with a +4.6% performance improvement on the evaluation set. This demonstrates that some layers can actually harm model performance when their contribution is negative. Also ofc in many other pruning tests the performance was decreased, this one above is extremely cherry picked, only to demonstrate the potential :D

Contributing

Contributions welcome! Areas for improvement:

Additional pruning strategies
More evaluation benchmarks
Architecture support (currently supports Llama-style and GPT-style models)
Quantization integration

Citation

If you use this toolkit in your research, please cite:

@software{llm_pruner,
  title = {LLM Pruner: A Universal Toolkit for Layer Pruning},
  author = {Eren Akbulut},
  year = {2024},
  url = {https://github.com/yourusername/llm_pruner}
}

Acknowledgments

This toolkit is built upon and inspired by several important works in the field:

Papers

"The Unreasonable Ineffectiveness of the Deeper Layers" - Gromov et al. (2024)
- arXiv:2403.17887
- Demonstrates that LLMs can have substantial layers removed with minimal performance loss
- Foundation for our layer redundancy analysis approaches
"Shorter is Better: Depth Pruning for Large Language Models" - Bo et al. (2024)
- arXiv:2411.02117
- Provides insights into depth-based pruning strategies

Open Source Projects

MergeKit by Arcee AI
- Used for model merging and layer manipulation
- Our tool exports MergeKit-compatible configurations
PruneMe by Arcee AI
- Inspiration for angular distance-based pruning approach
- Block similarity computation methods

Techniques

Ablation studies in neural networks
AVSS (Activation Variance Scoring System)
Angular distance measurement for layer similarity
Iterative optimization with performance healing

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Made for the open source community

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
llm_pruner		llm_pruner
.gitignore		.gitignore
LIBRARY_STRUCTURE.md		LIBRARY_STRUCTURE.md
README.md		README.md
USAGE.md		USAGE.md
example.py		example.py
quickstart.py		quickstart.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

LLM Pruner

Features

Quick Start

Installation

Three Ways to Use

1. Command Line (Easiest!)

2. Python API (Most Flexible)

Usage Examples

Example 1: Ablation Pruning

Example 2: AVSS (Variance-Based)

Example 3: PruneMe (Angular Distance)

Example 4: Smart Pruning with Optimization

Example 5: Comprehensive Evaluation

Example 6: Visualizations

Complete Example

Evaluation Data

Load from scratch:

Upload to HuggingFace:

Load from HuggingFace:

Advanced Features

Manual Pruning

Custom Evaluation

MergeKit Integration

Performance Tips

Pruning Strategies Compared

Example Results

Contributing

Citation

Acknowledgments

Papers

Open Source Projects

Techniques

License

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages