Secure and Fast Local LLM Inference Service

This is a FastAPI-based LLM service for running text generation (inference) with latency tracking, rate limiting, and Prometheus metrics.

Setup

1. Create a virtual environment

python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

3. Run the API server

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Example Request

curl -X POST http://localhost:8000/v1/infer \
    -H "Authorization: Bearer demo-key-1" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Write a haiku about speed"}'

Response

{
    "response": "Swift winds race the moon...",
    "model_inference_time": 1.23,
    "system_latency": 0.04,
    "total_time": 1.27
}

Logging and Metrics

API Configuration

Benchmarking

Run the benchmark tool to measure latency and throughput:

python benchmarks/bench.py

Output :

(.venv) PS C:\Users\CHANDRA\Documents\Github\llm-service\llm-service> python benchmarks/bench.py
200 6.8003338999988046
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 7.064282899998943
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 6.215892899999744
200 6.170206800001324
200 7.1004733000008855
200 7.064282899998943
200 1.9224737000040477
200 3.3218639999977313
200 6.501609099999769
200 6.8003338999988046
200 6.728283099997498
avg 5.577741190000234
p95 7.1004733000008855

Monitoring

Promotheus metrics available at:

GET /metrics

example :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 10367.0
python_gc_objects_collected_total{generation="1"} 1728.0
python_gc_objects_collected_total{generation="2"} 65.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 511.0
python_gc_collections_total{generation="1"} 46.0
python_gc_collections_total{generation="2"} 4.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="12",patchlevel="8",version="3.12.8"} 1.0
# HELP request_count_total Total Inference number of requests
# TYPE request_count_total counter
request_count_total{status="success"} 30.0
# HELP request_count_created Total Inference number of requests
# TYPE request_count_created gauge
request_count_created{status="success"} 1.7630548292791355e+09
# HELP llm_request_duration_seconds Request duration in seconds
# TYPE llm_request_duration_seconds histogram
llm_request_duration_seconds_bucket{le="0.005"} 0.0
llm_request_duration_seconds_bucket{le="0.01"} 0.0
llm_request_duration_seconds_bucket{le="0.025"} 0.0
llm_request_duration_seconds_bucket{le="0.05"} 0.0
llm_request_duration_seconds_bucket{le="0.075"} 0.0
llm_request_duration_seconds_bucket{le="0.1"} 0.0
llm_request_duration_seconds_bucket{le="0.25"} 0.0
llm_request_duration_seconds_bucket{le="0.5"} 0.0
llm_request_duration_seconds_bucket{le="0.75"} 0.0
llm_request_duration_seconds_bucket{le="1.0"} 0.0
llm_request_duration_seconds_bucket{le="2.5"} 5.0
llm_request_duration_seconds_bucket{le="5.0"} 17.0
llm_request_duration_seconds_bucket{le="7.5"} 26.0
llm_request_duration_seconds_bucket{le="10.0"} 27.0
llm_request_duration_seconds_bucket{le="+Inf"} 30.0
llm_request_duration_seconds_count 30.0
llm_request_duration_seconds_sum 152.02004249997117
# HELP llm_request_duration_seconds_created Request duration in seconds
# TYPE llm_request_duration_seconds_created gauge
llm_request_duration_seconds_created 1.7630548086823726e+09
# HELP llm_model_infer_duration_seconds Model inference duration in seconds
# TYPE llm_model_infer_duration_seconds histogram
llm_model_infer_duration_seconds_bucket{le="0.005"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.01"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.025"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.05"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.075"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.1"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.25"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.5"} 0.0
llm_model_infer_duration_seconds_bucket{le="0.75"} 0.0
llm_model_infer_duration_seconds_bucket{le="1.0"} 0.0
llm_model_infer_duration_seconds_bucket{le="2.5"} 5.0
llm_model_infer_duration_seconds_bucket{le="5.0"} 17.0
llm_model_infer_duration_seconds_bucket{le="7.5"} 26.0
llm_model_infer_duration_seconds_bucket{le="10.0"} 27.0
llm_model_infer_duration_seconds_bucket{le="+Inf"} 30.0
llm_model_infer_duration_seconds_count 30.0
llm_model_infer_duration_seconds_sum 152.0094532999865
# HELP llm_model_infer_duration_seconds_created Model inference duration in seconds
# TYPE llm_model_infer_duration_seconds_created gauge
llm_model_infer_duration_seconds_created 1.7630548086823726e+09
# HELP llm_system_overhead_duration_seconds Non-model system latency per request
# TYPE llm_system_overhead_duration_seconds histogram
llm_system_overhead_duration_seconds_bucket{le="0.005"} 25.0
llm_system_overhead_duration_seconds_bucket{le="0.01"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.025"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.05"} 30.0
llm_system_overhead_duration_seconds_bucket{le="0.075"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.1"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.25"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="0.75"} 31.0
llm_system_overhead_duration_seconds_bucket{le="1.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="2.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="5.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="7.5"} 31.0
llm_system_overhead_duration_seconds_bucket{le="10.0"} 31.0
llm_system_overhead_duration_seconds_bucket{le="+Inf"} 31.0
llm_system_overhead_duration_seconds_count 31.0
llm_system_overhead_duration_seconds_sum 0.1829078000082518
# HELP llm_system_overhead_duration_seconds_created Non-model system latency per request
# TYPE llm_system_overhead_duration_seconds_created gauge
llm_system_overhead_duration_seconds_created 1.7630548086823726e+09

Latency stats (system-only):

GET /latency

example:

{
  "avg": 5.90025161316941,
  "p95": 7.74999999703141,
  "count": 31
}

Project Structure

llm-service/
├── app/
│   ├── main.py              # FastAPI app, routes, middleware
│   ├── rate_limit.py        # Token bucket rate limiter
│   ├── metrics.py           # Prometheus metric collectors
│   └── model.py             # Model inference logic
├── benchmarks/
│   └── bench.py             # Benchmark script
├── requirements.txt
└── README.md

API Endpoints Used

Endpoint	Method	Description
/v1/infer	POST	Run inference on prompt
/v1/chat/completions	POST	Compatible endpoint for OpenAI
/latency	GET	Fetch system latency stats
/metrics	GET	Prometheus metrics (optional)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
frontend		frontend
llm-service		llm-service
README.md		README.md
Screenshot 2025-11-13 233219-1.png		Screenshot 2025-11-13 233219-1.png
Screenshot 2025-11-13 233219.png		Screenshot 2025-11-13 233219.png
Screenshot 2025-11-13 233230.png		Screenshot 2025-11-13 233230.png
Screenshot 2025-11-13 233306.png		Screenshot 2025-11-13 233306.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Secure and Fast Local LLM Inference Service

Setup

1. Create a virtual environment

2. Install dependencies

3. Run the API server

Example Request

Response

Logging and Metrics

API Configuration

Benchmarking

Monitoring

Project Structure

API Endpoints Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Secure and Fast Local LLM Inference Service

Setup

1. Create a virtual environment

2. Install dependencies

3. Run the API server

Example Request

Response

Logging and Metrics

API Configuration

Benchmarking

Monitoring

Project Structure

API Endpoints Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages