A Python tool for testing LLM inference speed with parallel requests. Supports OpenAI-compatible APIs like Ollama.
IMPORTANT: This tool was entirely generated by AI. USE AT YOUR OWN RISK
- Parallel testing: Run multiple concurrent generation threads
- TPS tracking: Calculate TPS per request (tokens / request_duration)
- Cumulative stats: min/max/avg TPS across all completed requests
- Per-thread stats: Individual statistics for each worker thread
- Graceful shutdown: Soft stop (wait for completions) or hard stop (immediate)
- Ollama compatible: Works out of the box with local Ollama instances
pip install -r requirements.txtpython -m src.inference_speed_testpython -m src.inference_speed_test --parallel 4 --duration 120python -m src.inference_speed_test [OPTIONS]
Options:
--base-url URL Base URL of the API server (default: http://localhost:11434)
--endpoint {response,chat}
API endpoint to use (default: chat)
--parallel N Number of parallel generation threads (default: 1)
--prompt TEXT Prompt to send to the model
--duration SECONDS Test duration in seconds (default: 60)
--model MODEL Model name to use (default: auto-detect)
-h, --help Show help message
thread_tps = thread_total_tokens / (last_request_end - first_request_start)
total_tps = sum of all thread_tps values
- Min: Minimum of all thread minimum TPS values
- Avg: Sum of all thread average TPS values
- Max: Maximum of all thread maximum TPS values
[T+10.0s] Completed: 3 | TPS - min: 12.45, avg: 18.32, max: 23.10 | Total: 45.20
[T+20.0s] Completed: 7 | TPS - min: 11.20, avg: 17.85, max: 24.50 | Total: 87.40
================================================================================
FINAL RESULTS
================================================================================
Total time: 60.50s
Total requests: 21
Total tokens: 15234
Total TPS (sum of per-thread TPS): 87.40
Global TPS stats:
Min of thread mins: 11.20
Sum of thread avgs: 72.10
Max of thread maxs: 25.30
--------------------------------------------------------------------------------
Per-thread statistics:
--------------------------------------------------------------------------------
Thread Reqs Tokens Time(s) TPS Min Avg Max
--------------------------------------------------------------------------------
0 6 4521 58.20 22.50 12.50 18.20 22.10
1 5 3890 59.10 19.80 11.20 16.50 20.30
2 5 3412 57.50 21.40 13.10 19.20 25.30
3 5 3411 58.80 18.70 12.80 18.20 23.40
================================================================================
# Chat completions endpoint (default)
python -m src.inference_speed_test --endpoint chat
# Legacy completions endpoint
python -m src.inference_speed_test --endpoint responsepython -m src.inference_speed_test --prompt "Write a detailed analysis of climate change"python -m src.inference_speed_test --model llama3.3 --parallel 4python -m src.inference_speed_test --base-url http://192.168.1.100:11434python -m src.inference_speed_test \
--base-url http://localhost:11434 \
--endpoint chat \
--parallel 8 \
--duration 300 \
--model qwen2.5-coder:14bDuring the test:
- Ctrl+C once: Soft stop - waits for current requests to complete (30s timeout)
- Ctrl+C twice: Hard stop - terminates immediately, displaying results
- Elapsed time
- Number of completed responses
- Cumulative TPS statistics (min/avg/max across all requests)
- Total time and completions
- Total tokens generated
- Overall TPS (min/avg/max)
- Per-thread breakdown with:
- Request count
- Total tokens
- TPS (min/avg/max) for that thread
The default prompt is designed to generate long responses:
Write a comprehensive essay about the history and impact of artificial intelligence. Cover the early beginnings from the 1950s, through the expert systems of the 1980s, the machine learning revolution of the 2000s, to the modern era of large language models. Discuss key figures like Turing, McCarthy, Hinton, and Bengio. Explain the technological breakthroughs, the winters and springs of AI development, and the societal impacts including both benefits and concerns. Be thorough and detailed in your response, providing specific examples and dates.
- Python 3.8+
- requests
MIT License - see LICENSE file