Skip to content

KT19/inference-engine-gpt2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inference Engine for GPT2

This repository demonstrates an inference engine built in CUDA for GPT2 series.

Build the Python Module

While the engine can be called directly from C++ (see src/cpp/main), the project is primarily designed to be used via Python bindings.

Setup

The uv package manager is used for python development. Before proceeding to CUDA compilation, please first create a python environment, which is required for the bindings.

cd python
uv sync

For more details, see python directory.

Built

Once the Python environment is setup, return to the root directory and run the following commands to compile the project.

mkdir build
cd build
cmake ..
make -j

Result

GPT2 Standard generation vs KV-cache enabled generation

GPT2-XL Standard generation vs KV-cache enabled generation

As the plot shows, this engine achieves performance competitive with Hugging Face Transformers in naive generation and provides faster generation when KV caching is enabled.

THIRD-PARTY-NOTICES:

This project uses the JSON for Modern C++ library (nlohmann/json), which is licensed under the MIT License. Copyright (c) 2013-2026 Niels Lohmann.

About

Inference Engine for GPT2 implemented in CUDA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors