LUT-LLM is the first FPGA accelerator that deploy 1B+ language model with memory-based computation, leveraging vector quantization. LUT-LLM features:
- Activation-weight Co-quantization: shrinked lookup tables with comparable accuracy compared with standard scalar quantization schemes.
- Bandwidth-aware Parallel Centroid Search: tradeoffs between resource consumption for parallel search and latency of pipeline propagation during decoding.
- Efficient 2D table lookup: extract rows and then copy to reduce fanout with low on-chip capacity required per operation at runtime.
- Temporal-Spatial Hybrid Execution: LUT-LLM sequentially execute between LUTLinear and other engines, and keep dataflow inside each engine.
Make sure your system has Vitis/Vivado 2024.2 and Gurobi installed. You may need to run settings.sh for Vitis and Vivado to set up the path. Please make sure you have the synthesis and implementation license for part number xcv80-lsva4737-2MHP-e-S (AMD V80 FPGA).
- Install TAPA: Download and untar this folder into your home and add the
PATHvariable in your~/.bashrc
tar -xf tapa.tar
export PATH="$PATH:$HOME/.rapidstream-tapa/usr/bin"If you got tapa.tar.gz:
tar -xzvf tapa.tar.gz
export PATH="$PATH:$HOME/.rapidstream-tapa/usr/bin"You can replace $HOME with the absolute path of .rapidstream-tapa.
- Generate host executable for prefill and decode.
cd qwen_block
make csim
make csim_decodeNote
If you encounter the error tapa: no such file or directory, you can add the export PATH command in the makefile.
To change the input length, simplying change the const int L to the value (either 32 or 128) in the *_tb.cpp files.
- Run C-simulation
./qwen_block
./qwen_block_decode- Run HLS
make hls- Run RTL Simulation
./qwen_block --bitstream=qwen_block.xo -xosim_save_waveform -xosim_work_dir=waveform/
./qwen_block_decode --bitstream=qwen_block.xo -xosim_save_waveform -xosim_work_dir=waveform/This can take several hours, so use tmux to run it at background.
Note
If you are using the VASTLab cluster as the guest account, we can provide the wdb directly to save the time of running RTL simulation.
-
Open the waveform in
waveform/output/run/vivado/tapa-fast-cosim.sim/with Vivado and get the cycle count for each. You should do it byFlow > Open Static Simulation, then addap_startandap_donesignals indutmodule to the wave window. Calculate the latency between these two signals and divide it by 4. -
Run the e2e latency calculator to validate. Use the cycle count from the previous step as the argument for the script.
make e2e_latency
./e2e_latency <prefill_cycle> <decode_cycle> <input_len> <output_len>| Platform | [32, 16] | [32, 64] | [32, 256] | [128, 16] | [128, 64] | [128, 256] |
|---|---|---|---|---|---|---|
| A100 80GB (BF16) | 241.4 | 935.0 | 3895.8 | 323.7 | 1080.2 | 4086.2 |
| A100 80GB (INT8) | 96.0 | 404.6 | 1683.0 | 128.7 | 467.3 | 1765.1 |
| A100 80GB (INT4) | 88.3 | 377.1 | 1460.5 | 118.1 | 435.4 | 1531.6 |
| MI210 (BF16) | 268.0 | 1134.0 | 4336.2 | 356.7 | 1307.7 | 4546.0 |
| MI210 (INT8) | 268.1 | 1136.4 | 4394.0 | 371.9 | 1323.5 | 4618.6 |
| LUT-LLM | 105.9 | 351.1 | 1331.8 | 284.7 | 529.9 | 1510.7 |
Column headers are
[batch size, sequence length].
GPU latency are measured using
vllm bench latency
ccu: the bandwidth-aware centroid search unit (BPCSU)ffn: the feedforward layergqa: the grouped-query attentionimm: the 2D table lookup enginelut-dla: the LUTLinear enginerope,rms_norm,silu: non-linear operations (RoPE, RMSNorm, Sigmoid ReLU)qwen_block: the Qwen 3 1.7B modele2e_latency.cpp: latency calculatorexample.pwr: power reportqwen_v80.pdi: bitstream for V80timing.rpt: post-routing timing reportqwen_block_tb,qwen_block_decode_tb: host for prefill and decode
qwen_lut_model: performance modeling scriptscustom_design: scripts to generate design connected with HBMrapidstream_script: scripts to use RapidStream for floorplanning


