Inference Engine for GPT2

This repository demonstrates an inference engine built in CUDA for GPT2 series.

Build the Python Module

While the engine can be called directly from C++ (see src/cpp/main), the project is primarily designed to be used via Python bindings.

Setup

The uv package manager is used for python development. Before proceeding to CUDA compilation, please first create a python environment, which is required for the bindings.

cd python
uv sync

For more details, see python directory.

Built

Once the Python environment is setup, return to the root directory and run the following commands to compile the project.

mkdir build
cd build
cmake ..
make -j

Result

GPT2 Standard generation vs KV-cache enabled generation

GPT2-XL Standard generation vs KV-cache enabled generation

As the plot shows, this engine achieves performance competitive with Hugging Face Transformers in naive generation and provides faster generation when KV caching is enabled.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
include		include
python		python
src		src
tests		tests
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Engine for GPT2

Build the Python Module

Setup

Built

Result

THIRD-PARTY-NOTICES:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inference Engine for GPT2

Build the Python Module

Setup

Built

Result

THIRD-PARTY-NOTICES:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages