This is a FastAPI-based LLM service for running text generation (inference) with latency tracking, rate limiting, and Prometheus metrics.
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activatepip install -r requirements.txtuvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadcurl -X POST http://localhost:8000/v1/infer \
-H "Authorization: Bearer demo-key-1" \
-H "Content-Type: application/json" \
-d '{"prompt": "Write a haiku about speed"}'{
"response": "Swift winds race the moon...",
"model_inference_time": 1.23,
"system_latency": 0.04,
"total_time": 1.27
}Run the benchmark tool to measure latency and throughput:
python benchmarks/bench.pyOutput :
(.venv) PS C:\Users\CHANDRA\Documents\Github\llm-service\llm-service> python benchmarks/bench.py
200 6.8003338999988046
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 7.064282899998943
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 7.064282899998943
200 1.9224737000040477
200 3.3218639999977313
200 6.501609099999769
200 6.8003338999988046
200 6.728283099997498
avg 5.577741190000234
p95 7.1004733000008855Promotheus metrics available at:
GET /metricsexample :
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 10367.0
python_gc_objects_collected_total{generation="1"} 1728.0
python_gc_objects_collected_total{generation="2"} 65.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 511.0
python_gc_collections_total{generation="1"} 46.0
python_gc_collections_total{generation="2"} 4.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="12",patchlevel="8",version="3.12.8"} 1.0
# HELP request_count_total Total Inference number of requests
# TYPE request_count_total counter
request_count_total{status="success"} 30.0
# HELP request_count_created Total Inference number of requests
# TYPE request_count_created gauge
request_count_created{status="success"} 1.7630548292791355e+09
# HELP llm_request_duration_seconds Request duration in seconds
# TYPE llm_request_duration_seconds histogram
llm_request_duration_seconds_bucket{le="0.005"} 0.0
llm_request_duration_seconds_bucket{le="0.01"} 0.0
llm_request_duration_seconds_bucket{le="0.025"} 0.0
llm_request_duration_seconds_bucket{le="0.05"} 0.0
llm_request_duration_seconds_bucket{le="0.075"} 0.0
llm_request_duration_seconds_bucket{le="0.1"} 0.0
llm_request_duration_seconds_bucket{le="0.25"} 0.0
llm_request_duration_seconds_bucket{le="0.5"} 0.0
llm_request_duration_seconds_bucket{le="0.75"} 0.0
llm_request_duration_seconds_bucket{le="1.0"} 0.0
llm_request_duration_seconds_bucket{le="2.5"} 5.0
llm_request_duration_seconds_bucket{le="5.0"} 17.0
llm_request_duration_seconds_bucket{le="7.5"} 26.0
llm_request_duration_seconds_bucket{le="10.0"} 27.0
llm_request_duration_seconds_bucket{le="+Inf"} 30.0
llm_request_duration_seconds_count 30.0
llm_request_duration_seconds_sum 152.02004249997117
# HELP llm_request_duration_seconds_created Request duration in seconds
# TYPE llm_request_duration_seconds_created gauge
llm_request_duration_seconds_created 1.7630548086823726e+09
# HELP llm_model_infer_duration_seconds Model inference duration in seconds
# TYPE llm_model_infer_duration_seconds histogram
llm_model_infer_duration_seconds_bucket{le="0.005"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.01"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.025"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.05"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.075"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.1"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.25"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.5"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.75"} 0.0
llm_model_infer_duration_seconds_bucket{le="1.0"} 0.0
llm_model_infer_duration_seconds_bucket{le="2.5"} 5.0
llm_model_infer_duration_seconds_bucket{le="5.0"} 17.0
llm_model_infer_duration_seconds_bucket{le="7.5"} 26.0
llm_model_infer_duration_seconds_bucket{le="10.0"} 27.0
llm_model_infer_duration_seconds_bucket{le="+Inf"} 30.0
llm_model_infer_duration_seconds_count 30.0
llm_model_infer_duration_seconds_sum 152.0094532999865
# HELP llm_model_infer_duration_seconds_created Model inference duration in seconds
# TYPE llm_model_infer_duration_seconds_created gauge
llm_model_infer_duration_seconds_created 1.7630548086823726e+09
# HELP llm_system_overhead_duration_seconds Non-model system latency per request
# TYPE llm_system_overhead_duration_seconds histogram
llm_system_overhead_duration_seconds_bucket{le="0.005"} 25.0
llm_system_overhead_duration_seconds_bucket{le="0.01"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.025"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.05"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.075"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.1"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.25"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.75"} 31.0
llm_system_overhead_duration_seconds_bucket{le="1.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="2.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="5.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="7.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="10.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="+Inf"} 31.0
llm_system_overhead_duration_seconds_count 31.0
llm_system_overhead_duration_seconds_sum 0.1829078000082518
# HELP llm_system_overhead_duration_seconds_created Non-model system latency per request
# TYPE llm_system_overhead_duration_seconds_created gauge
llm_system_overhead_duration_seconds_created 1.7630548086823726e+09Latency stats (system-only):
GET /latencyexample:
{
"avg": 5.90025161316941,
"p95": 7.74999999703141,
"count": 31
}llm-service/
├── app/
│ ├── main.py # FastAPI app, routes, middleware
│ ├── rate_limit.py # Token bucket rate limiter
│ ├── metrics.py # Prometheus metric collectors
│ └── model.py # Model inference logic
├── benchmarks/
│ └── bench.py # Benchmark script
├── requirements.txt
└── README.md| Endpoint | Method | Description |
|---|---|---|
| /v1/infer | POST | Run inference on prompt |
| /v1/chat/completions | POST | Compatible endpoint for OpenAI |
| /latency | GET | Fetch system latency stats |
| /metrics | GET | Prometheus metrics (optional) |


