LUT-LLM: Efficient Language Model Inference with Memory-based Computations on FPGAs

LUT-LLM is the first FPGA accelerator that deploy 1B+ language model with memory-based computation, leveraging vector quantization. LUT-LLM features:

Activation-weight Co-quantization: shrinked lookup tables with comparable accuracy compared with standard scalar quantization schemes.
Bandwidth-aware Parallel Centroid Search: tradeoffs between resource consumption for parallel search and latency of pipeline propagation during decoding.
Efficient 2D table lookup: extract rows and then copy to reduce fanout with low on-chip capacity required per operation at runtime.
Temporal-Spatial Hybrid Execution: LUT-LLM sequentially execute between LUTLinear and other engines, and keep dataflow inside each engine.

Artifact Evaluation

Make sure your system has Vitis/Vivado 2024.2 and Gurobi installed. You may need to run settings.sh for Vitis and Vivado to set up the path. Please make sure you have the synthesis and implementation license for part number xcv80-lsva4737-2MHP-e-S (AMD V80 FPGA).

Install TAPA: Download and untar this folder into your home and add the PATH variable in your ~/.bashrc

tar -xf tapa.tar
export PATH="$PATH:$HOME/.rapidstream-tapa/usr/bin"

If you got tapa.tar.gz:

tar -xzvf tapa.tar.gz
export PATH="$PATH:$HOME/.rapidstream-tapa/usr/bin"

You can replace $HOME with the absolute path of .rapidstream-tapa.

Generate host executable for prefill and decode.

cd qwen_block
make csim
make csim_decode

Note

If you encounter the error tapa: no such file or directory, you can add the export PATH command in the makefile.

To change the input length, simplying change the const int L to the value (either 32 or 128) in the *_tb.cpp files.

Run C-simulation

./qwen_block
./qwen_block_decode

Run HLS

make hls

Run RTL Simulation

./qwen_block --bitstream=qwen_block.xo -xosim_save_waveform -xosim_work_dir=waveform/
./qwen_block_decode --bitstream=qwen_block.xo -xosim_save_waveform -xosim_work_dir=waveform/

This can take several hours, so use tmux to run it at background.

Note

If you are using the VASTLab cluster as the guest account, we can provide the wdb directly to save the time of running RTL simulation.

Open the waveform in waveform/output/run/vivado/tapa-fast-cosim.sim/ with Vivado and get the cycle count for each. You should do it by Flow > Open Static Simulation, then add ap_start and ap_done signals in dut module to the wave window. Calculate the latency between these two signals and divide it by 4.
Run the e2e latency calculator to validate. Use the cycle count from the previous step as the argument for the script.

make e2e_latency
./e2e_latency <prefill_cycle> <decode_cycle> <input_len> <output_len>

Inference Latency (ms)

Platform	[32, 16]	[32, 64]	[32, 256]	[128, 16]	[128, 64]	[128, 256]
A100 80GB (BF16)	241.4	935.0	3895.8	323.7	1080.2	4086.2
A100 80GB (INT8)	96.0	404.6	1683.0	128.7	467.3	1765.1
A100 80GB (INT4)	88.3	377.1	1460.5	118.1	435.4	1531.6
MI210 (BF16)	268.0	1134.0	4336.2	356.7	1307.7	4546.0
MI210 (INT8)	268.1	1136.4	4394.0	371.9	1323.5	4618.6
LUT-LLM	105.9	351.1	1331.8	284.7	529.9	1510.7

Column headers are [batch size, sequence length].

GPU latency are measured using vllm bench latency

Project Structure

ccu: the bandwidth-aware centroid search unit (BPCSU)
ffn: the feedforward layer
gqa: the grouped-query attention
imm: the 2D table lookup engine
lut-dla: the LUTLinear engine
rope, rms_norm, silu: non-linear operations (RoPE, RMSNorm, Sigmoid ReLU)
qwen_block: the Qwen 3 1.7B model
- e2e_latency.cpp: latency calculator
- example.pwr: power report
- qwen_v80.pdi: bitstream for V80
- timing.rpt: post-routing timing report
- qwen_block_tb, qwen_block_decode_tb: host for prefill and decode
qwen_lut_model: performance modeling scripts
custom_design: scripts to generate design connected with HBM
rapidstream_script: scripts to use RapidStream for floorplanning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LUT-LLM: Efficient Language Model Inference with Memory-based Computations on FPGAs

Artifact Evaluation

Inference Latency (ms)

Project Structure

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
attention_block		attention_block
ccu		ccu
config		config
custom_design		custom_design
ffn		ffn
figs		figs
gqa		gqa
imm		imm
lut-dla		lut-dla
qwen_block		qwen_block
qwen_lut_model		qwen_lut_model
rapidstream_script		rapidstream_script
rms_norm		rms_norm
rope		rope
silu		silu
.gitignore		.gitignore
README.md		README.md
create_bd_design_final.tcl		create_bd_design_final.tcl

Folders and files

Latest commit

History

Repository files navigation

LUT-LLM: Efficient Language Model Inference with Memory-based Computations on FPGAs

Artifact Evaluation

Inference Latency (ms)

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages