Skip to content

imckr/llm-service

Repository files navigation

Secure and Fast Local LLM Inference Service

This is a FastAPI-based LLM service for running text generation (inference) with latency tracking, rate limiting, and Prometheus metrics.


Setup

1. Create a virtual environment

python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

3. Run the API server

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Example Request

curl -X POST http://localhost:8000/v1/infer \
    -H "Authorization: Bearer demo-key-1" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Write a haiku about speed"}'

Response

{
    "response": "Swift winds race the moon...",
    "model_inference_time": 1.23,
    "system_latency": 0.04,
    "total_time": 1.27
}

Logging and Metrics

alt text alt text

API Configuration

alt text

Benchmarking

Run the benchmark tool to measure latency and throughput:

python benchmarks/bench.py

Output :

(.venv) PS C:\Users\CHANDRA\Documents\Github\llm-service\llm-service> python benchmarks/bench.py
200 6.8003338999988046
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 7.064282899998943
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 7.064282899998943
200 1.9224737000040477
200 3.3218639999977313
200 6.501609099999769
200 6.8003338999988046
200 6.728283099997498
avg 5.577741190000234
p95 7.1004733000008855

Monitoring

Promotheus metrics available at:

GET /metrics

example :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 10367.0
python_gc_objects_collected_total{generation="1"} 1728.0
python_gc_objects_collected_total{generation="2"} 65.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 511.0
python_gc_collections_total{generation="1"} 46.0
python_gc_collections_total{generation="2"} 4.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="12",patchlevel="8",version="3.12.8"} 1.0
# HELP request_count_total Total Inference number of requests
# TYPE request_count_total counter
request_count_total{status="success"} 30.0
# HELP request_count_created Total Inference number of requests
# TYPE request_count_created gauge
request_count_created{status="success"} 1.7630548292791355e+09
# HELP llm_request_duration_seconds Request duration in seconds
# TYPE llm_request_duration_seconds histogram
llm_request_duration_seconds_bucket{le="0.005"} 0.0
llm_request_duration_seconds_bucket{le="0.01"} 0.0
llm_request_duration_seconds_bucket{le="0.025"} 0.0
llm_request_duration_seconds_bucket{le="0.05"} 0.0
llm_request_duration_seconds_bucket{le="0.075"} 0.0
llm_request_duration_seconds_bucket{le="0.1"} 0.0
llm_request_duration_seconds_bucket{le="0.25"} 0.0
llm_request_duration_seconds_bucket{le="0.5"} 0.0
llm_request_duration_seconds_bucket{le="0.75"} 0.0
llm_request_duration_seconds_bucket{le="1.0"} 0.0
llm_request_duration_seconds_bucket{le="2.5"} 5.0
llm_request_duration_seconds_bucket{le="5.0"} 17.0
llm_request_duration_seconds_bucket{le="7.5"} 26.0
llm_request_duration_seconds_bucket{le="10.0"} 27.0
llm_request_duration_seconds_bucket{le="+Inf"} 30.0
llm_request_duration_seconds_count 30.0
llm_request_duration_seconds_sum 152.02004249997117
# HELP llm_request_duration_seconds_created Request duration in seconds
# TYPE llm_request_duration_seconds_created gauge
llm_request_duration_seconds_created 1.7630548086823726e+09
# HELP llm_model_infer_duration_seconds Model inference duration in seconds
# TYPE llm_model_infer_duration_seconds histogram
llm_model_infer_duration_seconds_bucket{le="0.005"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.01"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.025"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.05"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.075"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.1"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.25"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.5"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.75"} 0.0
llm_model_infer_duration_seconds_bucket{le="1.0"} 0.0
llm_model_infer_duration_seconds_bucket{le="2.5"} 5.0
llm_model_infer_duration_seconds_bucket{le="5.0"} 17.0
llm_model_infer_duration_seconds_bucket{le="7.5"} 26.0
llm_model_infer_duration_seconds_bucket{le="10.0"} 27.0
llm_model_infer_duration_seconds_bucket{le="+Inf"} 30.0
llm_model_infer_duration_seconds_count 30.0
llm_model_infer_duration_seconds_sum 152.0094532999865
# HELP llm_model_infer_duration_seconds_created Model inference duration in seconds
# TYPE llm_model_infer_duration_seconds_created gauge
llm_model_infer_duration_seconds_created 1.7630548086823726e+09
# HELP llm_system_overhead_duration_seconds Non-model system latency per request
# TYPE llm_system_overhead_duration_seconds histogram
llm_system_overhead_duration_seconds_bucket{le="0.005"} 25.0
llm_system_overhead_duration_seconds_bucket{le="0.01"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.025"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.05"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.075"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.1"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.25"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.75"} 31.0
llm_system_overhead_duration_seconds_bucket{le="1.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="2.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="5.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="7.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="10.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="+Inf"} 31.0
llm_system_overhead_duration_seconds_count 31.0
llm_system_overhead_duration_seconds_sum 0.1829078000082518
# HELP llm_system_overhead_duration_seconds_created Non-model system latency per request
# TYPE llm_system_overhead_duration_seconds_created gauge
llm_system_overhead_duration_seconds_created 1.7630548086823726e+09

Latency stats (system-only):

GET /latency

example:

{
  "avg": 5.90025161316941,
  "p95": 7.74999999703141,
  "count": 31
}

Project Structure

llm-service/
├── app/
│   ├── main.py              # FastAPI app, routes, middleware
│   ├── rate_limit.py        # Token bucket rate limiter
│   ├── metrics.py           # Prometheus metric collectors
│   └── model.py             # Model inference logic
├── benchmarks/
│   └── bench.py             # Benchmark script
├── requirements.txt
└── README.md

API Endpoints Used

Endpoint Method Description
/v1/infer POST Run inference on prompt
/v1/chat/completions POST Compatible endpoint for OpenAI
/latency GET Fetch system latency stats
/metrics GET Prometheus metrics (optional)

About

A fast, secure local LLM inference service with API-key authentication, rate limiting, latency tracking, and structured logging. Runs open-source models locally, exposes both custom and OpenAI-compatible endpoints, and includes a dashboard for testing and monitoring. Docker-ready and easy to deploy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors