Skip to content

aupadhyay/attnres-exp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Attention Residuals + Value Residuals

Playing around with Attention Residuals on NanoGPT. Implemented a few variants, including adaptive block boundaries and implementing value residual learning.

Setup

uv sync
uv run python -m pytest tests/ -v

Training

# launch on Modal (detached, runs on A100)
modal run --detach modal_train.py --variant baseline # baseline (standard PreNorm residuals)
modal run --detach modal_train.py --variant full_attnres # full AttnRes
modal run --detach modal_train.py --variant block_attnres # block AttnRes
modal run --detach modal_train.py --variant adaptive_attnres # adaptive block boundaries
modal run --detach modal_train.py --variant value_residual # block AttnRes + value residual learning
modal run --detach modal_train.py --variant value_residual_only # value residual learning only

Visuals

The analyze/ directory has scripts to generate:

  • Depth attention heatmaps
  • Training dynamics (gradient norms, activation magnitudes)
  • Loss curve comparisons
  • Adaptive boundary gate plots
  • Value residual lambda-by-layer charts
  • Query vector PCA/cosine similarity plots
  • Per-token routing visualizations (these weren't very useful)

About

Experiments with Attention Residuals

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors