Skip to content

EPSILON0-dev/inference-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inference Speed Test

A Python tool for testing LLM inference speed with parallel requests. Supports OpenAI-compatible APIs like Ollama.

IMPORTANT: This tool was entirely generated by AI. USE AT YOUR OWN RISK

Features

  • Parallel testing: Run multiple concurrent generation threads
  • TPS tracking: Calculate TPS per request (tokens / request_duration)
  • Cumulative stats: min/max/avg TPS across all completed requests
  • Per-thread stats: Individual statistics for each worker thread
  • Graceful shutdown: Soft stop (wait for completions) or hard stop (immediate)
  • Ollama compatible: Works out of the box with local Ollama instances

Installation

pip install -r requirements.txt

Quick Start

Basic test (1 thread, 60 seconds)

python -m src.inference_speed_test

Test with 4 parallel threads

python -m src.inference_speed_test --parallel 4 --duration 120

Usage

python -m src.inference_speed_test [OPTIONS]

Options:
  --base-url URL        Base URL of the API server (default: http://localhost:11434)
  --endpoint {response,chat}
                        API endpoint to use (default: chat)
  --parallel N          Number of parallel generation threads (default: 1)
  --prompt TEXT         Prompt to send to the model
  --duration SECONDS    Test duration in seconds (default: 60)
  --model MODEL         Model name to use (default: auto-detect)
  -h, --help            Show help message

How TPS is Calculated

Per-thread TPS

thread_tps = thread_total_tokens / (last_request_end - first_request_start)

Total TPS

total_tps = sum of all thread_tps values

Global Stats

  • Min: Minimum of all thread minimum TPS values
  • Avg: Sum of all thread average TPS values
  • Max: Maximum of all thread maximum TPS values

Example Output

[T+10.0s] Completed: 3 | TPS - min: 12.45, avg: 18.32, max: 23.10 | Total: 45.20
[T+20.0s] Completed: 7 | TPS - min: 11.20, avg: 17.85, max: 24.50 | Total: 87.40

================================================================================
FINAL RESULTS
================================================================================
Total time: 60.50s
Total requests: 21
Total tokens: 15234

Total TPS (sum of per-thread TPS): 87.40

Global TPS stats:
  Min of thread mins: 11.20
  Sum of thread avgs: 72.10
  Max of thread maxs: 25.30

--------------------------------------------------------------------------------
Per-thread statistics:
--------------------------------------------------------------------------------
Thread   Reqs     Tokens     Time(s)    TPS        Min        Avg        Max       
--------------------------------------------------------------------------------
0        6        4521       58.20      22.50      12.50      18.20      22.10     
1        5        3890       59.10      19.80      11.20      16.50      20.30     
2        5        3412       57.50      21.40      13.10      19.20      25.30     
3        5        3411       58.80      18.70      12.80      18.20      23.40     
================================================================================

Examples

Test different endpoints

# Chat completions endpoint (default)
python -m src.inference_speed_test --endpoint chat

# Legacy completions endpoint
python -m src.inference_speed_test --endpoint response

Custom prompt

python -m src.inference_speed_test --prompt "Write a detailed analysis of climate change"

Specify model

python -m src.inference_speed_test --model llama3.3 --parallel 4

Custom server

python -m src.inference_speed_test --base-url http://192.168.1.100:11434

Full example

python -m src.inference_speed_test \
  --base-url http://localhost:11434 \
  --endpoint chat \
  --parallel 8 \
  --duration 300 \
  --model qwen2.5-coder:14b

Controls

During the test:

  • Ctrl+C once: Soft stop - waits for current requests to complete (30s timeout)
  • Ctrl+C twice: Hard stop - terminates immediately, displaying results

Output

During Test (every 10 seconds)

  • Elapsed time
  • Number of completed responses
  • Cumulative TPS statistics (min/avg/max across all requests)

Final Results

  • Total time and completions
  • Total tokens generated
  • Overall TPS (min/avg/max)
  • Per-thread breakdown with:
    • Request count
    • Total tokens
    • TPS (min/avg/max) for that thread

Default Prompt

The default prompt is designed to generate long responses:

Write a comprehensive essay about the history and impact of artificial intelligence. Cover the early beginnings from the 1950s, through the expert systems of the 1980s, the machine learning revolution of the 2000s, to the modern era of large language models. Discuss key figures like Turing, McCarthy, Hinton, and Bengio. Explain the technological breakthroughs, the winters and springs of AI development, and the societal impacts including both benefits and concerns. Be thorough and detailed in your response, providing specific examples and dates.

Requirements

  • Python 3.8+
  • requests

License

MIT License - see LICENSE file

About

Simple tool for measuring inference engine performance under multi-user load

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages