Inference Speed Test

A Python tool for testing LLM inference speed with parallel requests. Supports OpenAI-compatible APIs like Ollama.

IMPORTANT: This tool was entirely generated by AI. USE AT YOUR OWN RISK

Features

Parallel testing: Run multiple concurrent generation threads
TPS tracking: Calculate TPS per request (tokens / request_duration)
Cumulative stats: min/max/avg TPS across all completed requests
Per-thread stats: Individual statistics for each worker thread
Graceful shutdown: Soft stop (wait for completions) or hard stop (immediate)
Ollama compatible: Works out of the box with local Ollama instances

Installation

pip install -r requirements.txt

Quick Start

Basic test (1 thread, 60 seconds)

python -m src.inference_speed_test

Test with 4 parallel threads

python -m src.inference_speed_test --parallel 4 --duration 120

Usage

python -m src.inference_speed_test [OPTIONS]

Options:
  --base-url URL        Base URL of the API server (default: http://localhost:11434)
  --endpoint {response,chat}
                        API endpoint to use (default: chat)
  --parallel N          Number of parallel generation threads (default: 1)
  --prompt TEXT         Prompt to send to the model
  --duration SECONDS    Test duration in seconds (default: 60)
  --model MODEL         Model name to use (default: auto-detect)
  -h, --help            Show help message

How TPS is Calculated

Per-thread TPS

thread_tps = thread_total_tokens / (last_request_end - first_request_start)

Total TPS

total_tps = sum of all thread_tps values

Global Stats

Min: Minimum of all thread minimum TPS values
Avg: Sum of all thread average TPS values
Max: Maximum of all thread maximum TPS values

Example Output

[T+10.0s] Completed: 3 | TPS - min: 12.45, avg: 18.32, max: 23.10 | Total: 45.20
[T+20.0s] Completed: 7 | TPS - min: 11.20, avg: 17.85, max: 24.50 | Total: 87.40

================================================================================
FINAL RESULTS
================================================================================
Total time: 60.50s
Total requests: 21
Total tokens: 15234

Total TPS (sum of per-thread TPS): 87.40

Global TPS stats:
  Min of thread mins: 11.20
  Sum of thread avgs: 72.10
  Max of thread maxs: 25.30

--------------------------------------------------------------------------------
Per-thread statistics:
--------------------------------------------------------------------------------
Thread   Reqs     Tokens     Time(s)    TPS        Min        Avg        Max       
--------------------------------------------------------------------------------
0        6        4521       58.20      22.50      12.50      18.20      22.10     
1        5        3890       59.10      19.80      11.20      16.50      20.30     
2        5        3412       57.50      21.40      13.10      19.20      25.30     
3        5        3411       58.80      18.70      12.80      18.20      23.40     
================================================================================

Examples

Test different endpoints

# Chat completions endpoint (default)
python -m src.inference_speed_test --endpoint chat

# Legacy completions endpoint
python -m src.inference_speed_test --endpoint response

Custom prompt

python -m src.inference_speed_test --prompt "Write a detailed analysis of climate change"

Specify model

python -m src.inference_speed_test --model llama3.3 --parallel 4

Custom server

python -m src.inference_speed_test --base-url http://192.168.1.100:11434

Full example

python -m src.inference_speed_test \
  --base-url http://localhost:11434 \
  --endpoint chat \
  --parallel 8 \
  --duration 300 \
  --model qwen2.5-coder:14b

Controls

During the test:

Ctrl+C once: Soft stop - waits for current requests to complete (30s timeout)
Ctrl+C twice: Hard stop - terminates immediately, displaying results

Output

During Test (every 10 seconds)

Elapsed time
Number of completed responses
Cumulative TPS statistics (min/avg/max across all requests)

Final Results

Total time and completions
Total tokens generated
Overall TPS (min/avg/max)
Per-thread breakdown with:
- Request count
- Total tokens
- TPS (min/avg/max) for that thread

Default Prompt

The default prompt is designed to generate long responses:

Write a comprehensive essay about the history and impact of artificial intelligence. Cover the early beginnings from the 1950s, through the expert systems of the 1980s, the machine learning revolution of the 2000s, to the modern era of large language models. Discuss key figures like Turing, McCarthy, Hinton, and Bengio. Explain the technological breakthroughs, the winters and springs of AI development, and the societal impacts including both benefits and concerns. Be thorough and detailed in your response, providing specific examples and dates.

Requirements

Python 3.8+
requests

License

MIT License - see LICENSE file

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Speed Test

Features

Installation

Quick Start

Basic test (1 thread, 60 seconds)

Test with 4 parallel threads

Usage

How TPS is Calculated

Per-thread TPS

Total TPS

Global Stats

Example Output

Examples

Test different endpoints

Custom prompt

Specify model

Custom server

Full example

Controls

Output

During Test (every 10 seconds)

Final Results

Default Prompt

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inference Speed Test

Features

Installation

Quick Start

Basic test (1 thread, 60 seconds)

Test with 4 parallel threads

Usage

How TPS is Calculated

Per-thread TPS

Total TPS

Global Stats

Example Output

Examples

Test different endpoints

Custom prompt

Specify model

Custom server

Full example

Controls

Output

During Test (every 10 seconds)

Final Results

Default Prompt

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages