This repository demonstrates an inference engine built in CUDA for GPT2 series.
While the engine can be called directly from C++ (see src/cpp/main),
the project is primarily designed to be used via Python bindings.
The uv package manager is used for python development. Before proceeding to CUDA compilation, please first create a python environment, which is required for the bindings.
cd python
uv syncFor more details, see python directory.
Once the Python environment is setup, return to the root directory and run the following commands to compile the project.
mkdir build
cd build
cmake ..
make -jGPT2 Standard generation vs KV-cache enabled generation
GPT2-XL Standard generation vs KV-cache enabled generation
As the plot shows, this engine achieves performance competitive with Hugging Face Transformers in naive generation and provides faster generation when KV caching is enabled.
This project uses the JSON for Modern C++ library (nlohmann/json), which is licensed under the MIT License. Copyright (c) 2013-2026 Niels Lohmann.



